Let’s apply what we’ve learned in Chapter 8 to R.
Objectives
New Commands:
| R Command | Notes |
|---|---|
| table(df) | creates a frequency table of your dataframe; df is the name of your dataframe; if your dataframe has more than one column, you need specify each column using df$var |
| chisq.test(x, p, rescale.p = T) | runs a chi-squared test; x is your observed data, p is your expected probabilities or values, and rescale.p = T rescales probabilities if they do not equal 1. Do not use for Poisson p-value. |
| pchisq(q, df, lower.tail = F) | use if you know your chi-squared value AND/OR for the correct p-value for Poisson distributions; q is your chi-squared test statistic, df is your degrees of freedom, and lower.tail = F specifies it is a one-tailed test |
| dpois(x, lambda) | calculates expected probabilities for Poisson distributions; x is a vector of category values (e.g. 0:20), and lambda is the mean expected frequency (µ) |
Reminder: Save your script for practice at home! :)
Shortcuts:
| Symbol or Command | Keyboard Shortcut |
|---|---|
| <- | Alt + - |
| # | Shift + 3 |
| Run one line in script | Ctrl + Enter |
| Run entire script | Ctrl + Shift + Enter |
| Open new script | Ctrl + Shift + N |
Exercise 6
1. Open RStudio and prepare a new script.
# Filename: (what you will save the script as)
# Title: (give script a title)
# Author: (write your full name here)
# Date: Month Day Year (write the actual date here)
# Description: (describe what this script is for)
2. We will learn how to conduct a chi-squared test under the proportional model.
Pt I: Conducting a chi-squared test
There are two ways to conduct a chi-squared test under the proportional model in RStudio. We will review both methods.
a) Method I: You do NOT know X^2
If you do not know or have yet to calculate X^2, you can use the function: chisq.test( )
You only need to enter in three pieces of information:
b) Method II: You already know X^2
If you already know or calculated X^2 by hand, you can use the function: pchisq( )
You only need to enter in three pieces of information:
Note: Chi-squared tests are always one-tailed!
Pt II: Method I with M and Ms
M and M’s are drawn from a bag and the colour of each M and M is recorded.
MMlist <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/MandMlist.csv")
MMtable <- table(MMlist$color)
MMtable
##
## blue brown green orange red yellow
## 10 5 12 11 6 11
obs <- c(10,5,12,11,6,11)
obs
## [1] 10 5 12 11 6 11
exp <- c(0.24,0.14,0.16,0.20,0.13,0.13)
exp
## [1] 0.24 0.14 0.16 0.20 0.13 0.13
exp*sum(MMtable)
chisq.test(obs, p = exp, rescale.p = T)
Understanding your output
Let’s read through the output. The output tells you a number of things about your data:
So, from our output, we know that:
We can conclude: “p > 0.05, therefore our results are statistically insignificant, and we fail to reject the H0. Our data match our null expectations, and are likely to occur due to chance. We conclude that the frequency of colours of M and Ms in our bag matches the expectations of the M and M company.”
Pt III: Method II with M and Ms
If you already calculated X^2 in R or by hand, you can use the function pchisq. All you need is your chi-squared test statistic (q), your degrees of freedom (df), and to specify that the lower tail is FALSE.
pchisq(5.1442, df = 5, lower.tail = F)
Understanding your output
The output of pchisq is one number. This is your P -value. You will end up with the same P-value as you did with the output of chisq.test, since the df for both is the same.
So, from our output, we know that our P -value = 0.3985.
We can conclude: “p > 0.05, therefore our results are statistically insignificant, and we fail to reject the H0. Our data match our null expectations, and are likely to occur due to chance. We conclude that the frequency of colours of M and Ms in our bag matches the expectations of the M and M company.”
3. We will learn how to conduct a chi-squared test under the Poisson model.
It is thought that the World Cup decides the best soccer team in the world. But how much do skill differences determine the outcome? We will look at data on number of goals scored by one or two teams (randomly chosen) in every game of the knockout round of the World Cup from 1986 through 2010 to determine if the number of goals scored is random (follows a Poisson distribution).
worldcup <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/WorldCup.csv")
head(worldcup)
cuptable <- table(worldcup$score)
cuptable
obs <- c(32, 44, 21, 10, 4, 1)
obs
## [1] 32 44 21 10 4 1
mean(worldcup$score) #this is µ or lambda
exp <- dpois(x = 0:5, lambda = 1.223214)
exp
chisq.test(obs, p = exp, rescale.p = T)
## Warning in chisq.test(obs, p = exp, rescale.p = T): Chi-squared approximation
## may be incorrect
Notice an error has occurred! And p = NA! This error means we have violated an assumption of the chi-squared test. Some categories have expected values that are too small. How can you tell what those expected values are?
expfreq <- nrow(worldcup)*exp
expfreq
We see that one category has an expected value that is less than 1 (5 goals), and more than 20% of our expected values are less than 5 (4 goals and 5 goals). We need to combine some categories together to meet the assumptions of the chi-squared test. We can combine categories for 4 and 5, so the expected value is more than 1, and less than 20% of our expected values are less than 5.
cuptable #check original observed values from table
4 + 1 #find the sum of category 4 and category 5
obs <- c(32, 44, 21, 10, 5) #observed values
exp #check original expected proportions
0.027451333+0.006715771 #find the sum of category 4 and category 5
exp <- c(0.294282820, 0.359970866, 0.220160701, 0.089767884, 0.0341671) #expected values
exp
chisq.test(obs, p = exp, rescale.p = T)
## Warning in chisq.test(obs, p = exp, rescale.p = T): Chi-squared approximation
## may be incorrect
We can ignore the error this time because we know our expected values meet assumptions. It is just warning us that we have one expected value less than 5. This is fine in this case, because we have 5 observations, so 1/5 = only 20% of observations have an expected value of less than 5. We still meet assumptions.
pchisq(1.2647, df = 3, lower.tail = F)
So, from both outputs, we know that:
We can conclude: "p > 0.05, therefore our results are statistically insignificant, and we fail to reject the H0. We conclude that the number of goals scored in the knockout round of the World Cup follows a Poisson distribution.
Is this supported by the variance? Recall, if the variance = mean then data follow a Poisson distribution.
var(worldcup$score)
The variance = 1.256, which is close to the mean of 1.223. Additionally, the ratio of variance to mean = 1.02, which is very close to 1. We thus expect the data to follow a Poisson distribution.
Now it’s time to practice on your own.
| Day of week | # of birthdays |
|---|---|
| Monday | 33 |
| Tuesday | 41 |
| Wednesday | 63 |
| Thursday | 63 |
| Friday | 47 |
| Saturday | 56 |
| Sunday | 47 |
| Total | 350 |
a\. Enter the observed data (counts of birthdays across weekdays)
b\. Assume we expect equal numbers of birthdays on all days of the week. What is your null hypothesis?
c\. Conduct a chi-squared test to test this hypothesis. Note, if we expect equal numbers of birthdays on all days of the week, we do not need to enter "p" in the chi-squared test.
d\. Do you reject or fail to reject your H0? What is your conclusion statement?