R Exercise 6: Frequency data (Ch 8)

Let’s apply what we’ve learned in Chapter 8 to R.

Objectives

Calculate the chi-squared test statistic.
Be able to conduct a chi-squared test when you: (1) know chi-squared, and (2) when you don’t know chi-squared.
Estimate the µ frequency of Poisson data.
Calculate the expected probability and frequency of Poisson data.
Be able to conduct a chi-squared hypothesis test under the Poisson model.

New Commands:

R Command	Notes
table(df)	creates a frequency table of your dataframe; df is the name of your dataframe; if your dataframe has more than one column, you need specify each column using df$var
chisq.test(x, p, rescale.p = T)	runs a chi-squared test; x is your observed data, p is your expected probabilities or values, and rescale.p = T rescales probabilities if they do not equal 1. Do not use for Poisson p-value.
pchisq(q, df, lower.tail = F)	use if you know your chi-squared value AND/OR for the correct p-value for Poisson distributions; q is your chi-squared test statistic, df is your degrees of freedom, and lower.tail = F specifies it is a one-tailed test
dpois(x, lambda)	calculates expected probabilities for Poisson distributions; x is a vector of category values (e.g. 0:20), and lambda is the mean expected frequency (µ)

Reminder: Save your script for practice at home! :)

Shortcuts:

Symbol or Command	Keyboard Shortcut
<-	Alt + -
#	Shift + 3
Run one line in script	Ctrl + Enter
Run entire script	Ctrl + Shift + Enter
Open new script	Ctrl + Shift + N

Exercise 6
1. Open RStudio and prepare a new script.

Open a new script. All of your code for today’s exercises, and your notes and comments will go in this script. Write your filename, title, author, date, and description of the script.

# Filename: (what you will save the script as)
# Title: (give script a title)
# Author: (write your full name here)
# Date: Month Day Year (write the actual date here)

# Description: (describe what this script is for)

2. We will learn how to conduct a chi-squared test under the proportional model.

Pt I: Conducting a chi-squared test
There are two ways to conduct a chi-squared test under the proportional model in RStudio. We will review both methods.

a) Method I: You do NOT know X^2
If you do not know or have yet to calculate X^2, you can use the function: chisq.test( )

You only need to enter in three pieces of information:

x - your observed values (df$var or vector)
p - your expected proportions (df$var or vector)
rescale.p = T - if your proportions do NOT equal 1, R needs to rescale them to 1 for the test to work

b) Method II: You already know X^2
If you already know or calculated X^2 by hand, you can use the function: pchisq( )

You only need to enter in three pieces of information:

q - your observed values (df$var or vector)
df - your expected proportions (df$var or vector)
lower.tail = F - specifies a one-taile test

Note: Chi-squared tests are always one-tailed!

Pt II: Method I with M and Ms
M and M’s are drawn from a bag and the colour of each M and M is recorded.

Read the file into R from the GitHub using the following URL: https://github.com/lczawadzki/biostats/raw/main/data/MandMlist.csv.

     MMlist <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/MandMlist.csv")

This dataframe lists the colours, but doesn’t tell us the frequency directly. To be able to view the frequency of each colour, we need to sum the data into a table using: table(df$var). Type the following below. Now we can see the frequency of colours in this bag of M and Ms.

     MMtable <- table(MMlist$color)
     MMtable

## 
##   blue  brown  green orange    red yellow 
##     10      5     12     11      6     11

Our observed values are in the table. We can assign these to the variable obs using the function c(), which will “contain” all of the observed values.

     obs <- c(10,5,12,11,6,11)
     obs

## [1] 10  5 12 11  6 11

The M&M company states that 24% are blue, 14% brown, 16% green, 20% orange, 13% red, and 13% yellow. These are our expected proportions. We can assign these to the variable exp using the function c().

     exp <- c(0.24,0.14,0.16,0.20,0.13,0.13)
     exp

## [1] 0.24 0.14 0.16 0.20 0.13 0.13

If you are asked for the expected frequency of each M and M, you will multiply the expected proportions by the TOTAL number of MandMs in our sample.

     exp*sum(MMtable)

Now that we have our observed values and expected proportions, we can run a chi-squared test. Remember, the function is chisq.test. All you need are your observed values (x), expected proportions (p), and to specify rescale.p = T.

     chisq.test(obs, p = exp, rescale.p = T)

Understanding your output
Let’s read through the output. The output tells you a number of things about your data:

Line 1 tells you where your observed values are coming from.
Line 2 states X-squared which is your chi-squared test statistic, df which is your degrees of freedom, and your p -value. Remember, the P-value is the probability that a sample from the null model would be as extreme, or more extreme, than your sample. That is, it is the probability of obtaining your data under the null hypothesis or expectation.

So, from our output, we know that:

Our data comes from the variable obs.
Our X-squared test statisic equals 5.1442, our df (degrees of freedom) equals 5, and our P -value = 0.3985.

We can conclude: “p > 0.05, therefore our results are statistically insignificant, and we fail to reject the H0. Our data match our null expectations, and are likely to occur due to chance. We conclude that the frequency of colours of M and Ms in our bag matches the expectations of the M and M company.”

Pt III: Method II with M and Ms

If you already calculated X^2 in R or by hand, you can use the function pchisq. All you need is your chi-squared test statistic (q), your degrees of freedom (df), and to specify that the lower tail is FALSE.

     pchisq(5.1442, df = 5, lower.tail = F)

Understanding your output
The output of pchisq is one number. This is your P -value. You will end up with the same P-value as you did with the output of chisq.test, since the df for both is the same.

So, from our output, we know that our P -value = 0.3985.

It is thought that the World Cup decides the best soccer team in the world. But how much do skill differences determine the outcome? We will look at data on number of goals scored by one or two teams (randomly chosen) in every game of the knockout round of the World Cup from 1986 through 2010 to determine if the number of goals scored is random (follows a Poisson distribution).

Read the file into R from the GitHub using the following URL: https://github.com/lczawadzki/biostats/raw/main/data/WorldCup.csv.

     worldcup <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/WorldCup.csv")
     head(worldcup)

This dataframe lists the number of goals, but doesn’t tell us the frequency directly. To be able to view the frequency of each number of goals in a knockout round, we need to sum the data into a table using: table(df$var). Type the following below.

     cuptable <- table(worldcup$score)
     cuptable

Our observed values are in the table. We can assign these to the variable obs using the function c(), which will “contain” all of the observed values.

     obs <- c(32, 44, 21, 10, 4, 1)
     obs

## [1] 32 44 21 10  4  1

Next, we find our mean expected frequency (µ or lambda). Even though we have frequency data, since each observation is listed in a single column, mean() will calculate the mean correctly.

     mean(worldcup$score) #this is µ or lambda

We need to calculate our expected probabilities (these are how we specify the null expectation). We use the function dpois(), and specify arguments x which is a vector of the category values, and lambda your expected mean value. We will assign this to the variable exp, so that we can plug this directly into our chi-squared function.

     exp <- dpois(x = 0:5, lambda = 1.223214)
     exp

Now that we have our observed and expected proportions, we can run a chi-squared test to get our test statistic. Remember, the function is chisq.test. All you need are your observed values (x), expected proportions (p), and to specify rescale.p = T.

     chisq.test(obs, p = exp, rescale.p = T)

## Warning in chisq.test(obs, p = exp, rescale.p = T): Chi-squared approximation
## may be incorrect

Notice an error has occurred! And p = NA! This error means we have violated an assumption of the chi-squared test. Some categories have expected values that are too small. How can you tell what those expected values are?

     expfreq <- nrow(worldcup)*exp
     expfreq

We see that one category has an expected value that is less than 1 (5 goals), and more than 20% of our expected values are less than 5 (4 goals and 5 goals). We need to combine some categories together to meet the assumptions of the chi-squared test. We can combine categories for 4 and 5, so the expected value is more than 1, and less than 20% of our expected values are less than 5.

Now that we know our categories will be 0, 1, 2, 3, and 4+, we need to create a new vector for our observed data, which we can assign to the variable obs.

     cuptable #check original observed values from table
     4 + 1 #find the sum of category 4 and category 5
     obs <- c(32, 44, 21, 10, 5) #observed values

We also need to create a new vector for our expected proportions, which we can assign to the variable exp We need the expected values to calculate chi-squared later.

     exp #check original expected proportions
     0.027451333+0.006715771 #find the sum of category 4 and category 5
     exp <- c(0.294282820, 0.359970866, 0.220160701, 0.089767884, 0.0341671) #expected values
     exp

Now that we have our new observed and expected values that meet assumptions, we can run a chi-squared test to get our test statistic.

     chisq.test(obs, p = exp, rescale.p = T)

## Warning in chisq.test(obs, p = exp, rescale.p = T): Chi-squared approximation
## may be incorrect

We can ignore the error this time because we know our expected values meet assumptions. It is just warning us that we have one expected value less than 5. This is fine in this case, because we have 5 observations, so 1/5 = only 20% of observations have an expected value of less than 5. We still meet assumptions.

Then we will use pchisq to get our P-value. The reason we do NOT use the P-value from chisq.test is because it does NOT use the correct degrees of freedom for a Poisson model. Remember, that df = number of categories - 1 - number of parameters. Since we calculated one parameter (µ) to make our estimations, the df for a Poisson model is df = number of categories - 1 - 1. For our data, this is df = 5 - 1 - 1 = 3.

     pchisq(1.2647, df = 3, lower.tail = F)

So, from both outputs, we know that:

Our data comes from the variable obs.
Our X-squared test statisic equals 1.2741.
Our P-value with the correct df, equals 0.737532.

We can conclude: "p > 0.05, therefore our results are statistically insignificant, and we fail to reject the H0. We conclude that the number of goals scored in the knockout round of the World Cup follows a Poisson distribution.

Is this supported by the variance? Recall, if the variance = mean then data follow a Poisson distribution.

     var(worldcup$score)

The variance = 1.256, which is close to the mean of 1.223. Additionally, the ratio of variance to mean = 1.02, which is very close to 1. We thus expect the data to follow a Poisson distribution.

Now it’s time to practice on your own.

Let’s look at a sample of birthdays over different days of the week.

Day of week	# of birthdays
Monday	33
Tuesday	41
Wednesday	63
Thursday	63
Friday	47
Saturday	56
Sunday	47
Total	350

a\. Enter the observed data (counts of birthdays across weekdays)   

b\. Assume we expect equal numbers of birthdays on all days of the week. What is your null hypothesis?  

c\. Conduct a chi-squared test to test this hypothesis. Note, if we expect equal numbers of birthdays on all days of the week, we do not need to enter "p" in the chi-squared test.  

d\. Do you reject or fail to reject your H0? What is your conclusion statement?

R Exercise 6: Frequency data (Ch 8)

Dr. Z

2026