R Exercise 12: Comparing more than two means (Ch 15)

Let’s apply what we’ve learned in Chapter 15 to R.

Objectives

Conduct an ANOVA in R
Use nonparametric methods for non-normal data (Kruskal-Wallis test)
Run a post-hoc test (Tukey-Kramer test)

New Commands:

Arguments	Notes
df	df is name of dataframe
x	x is your numerical variable
y	y is your categorical variable
anova_name	a name for your anova, can be anything

R Command	Notes
anova_name <- aov(x ~ y, data = df)	runs an analysis of variance (ANOVA) for your data
summary(anova_name)	prints ANOVA table for your ANOVA
kruskal.test(x ~ y, data = df)	nonparametric alternative to the ANOVA
TukeyHSD(anova_name)	runs post-hoc pairwise comparisons for your data

Reminder: Save your script for practice at home! :)

Shortcuts:

Symbol or Command	Keyboard Shortcut
<-	Alt + -
#	Shift + 3
Run one line in script	Ctrl + Enter
Run entire script	Ctrl + Shift + Enter
Open new script	Ctrl + Shift + N

Exercise 12
1. Open RStudio and prepare a new script.

Open a new script. All of your code for today’s exercises, and your notes and comments will go in this script. Write your filename, title, author, date, and description of the script.

# Filename: (what you will save the script as)
# Title: (give script a title)
# Author: (write your full name here)
# Date: Month Day Year (write the actual date here)

# Description: (describe what this script is for)

Pt I: ANOVAs
Like t-tests, ANOVAs compare the means of groups. However, where t-tests compare one mean to a null (one-sample, paired), or the means of two groups to each other (two-sample), ANOVAs compare means of more than two groups.

In R, we use the function _aov( ). Once you read in your data, you need the following information:

x - your numerical variable
y - your categorical variable

Let’s practice through example.

Example: Passengers that traveled on the Titanic were able to purchase either first class, second class, or third class tickets. Ticket cost depended on the class, with first class tickets being the most expensive. Does ticket class impact age of individuals that purchased the tickets?

Step 1: Read in the data (available here: https://github.com/lczawadzki/biostats/raw/main/data/titanic_var.csv)

titanic <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/titanic_var.csv")

Step 2: Check your categorical variable passenger_class to make sure you have more than two groups. How many groups do we have?

str(titanic)   ## one way to look at dataframe
unique(titanic$passenger_class)   ##provides unique groups for a variable

From the dataframe structure, we see that passenger_class is a factor w/ 3 levels, meaning it is a categorical variable with 3 groups. Also, we verify this with the unique function and see that there are 3 unique groups of passenger_class. We can use an ANOVA to answer our question.

Step 3: Once you are sure you have 1 numerical variable, and 1 categorical variable with more than two groups, you can use the function aov( ) to run your ANOVA. Be sure to assign aov( ) to an anova_name. Let’s name our anova titanic_age.

titanic_age <- aov(age~passenger_class, data = titanic)

Step 4: Use summary( ) to print your ANOVA table.

summary(titanic_age)

Step 5: Interpret your output and state your conclusion.

Let’s read through the output. The output in this case is our ANOVA table. The table has 2 rows and 6 columns.

Row 1: This row will give you information on the variance among groups.
Row 2: This row will give you information on the variance within groups or our error. R refers to this as our “Residuals”.
Column 1: whether the data refers to groups or error (Residuals)
Column 2: Df = degrees of freedom
Column 3: Sum Sq = sum of squares. Recall, sum of squares is calculated as the squared value of your deviations. These deviations are either (1) between each observation’s group mean and the grand mean (among groups), or (2) between the observation and the group mean (within groups).
Column 4: Mean Sq = Mean Squares Group (\(MS_{\text group}\); row 1) or Mean Squares Error (\(MS_{\text error}\); row 2). Recall, \(MS_{\text group}\) is calculated as \(SS_{\text group}/df_{\text group}\) and \(MS_{\text error}\) is calculated as \(SS_{\text error}/df_{\text error}\).
Column 5: F value = your test statistic, F.
Column 6: Pr(>F) = your p-value. Significance codes for p-values are noted under the table (*** = < 0.01, ** = < 0.01, * = < 0.05, . = < 0.1, nothing = < 1).

Now, let’s look at our output by row:

Row 1: passenger_class

Passenger class has 2 degrees of freedom (df); (k - 1 = 3 - 1 = 2).
Sum of Squares of the group deviations (\(SS_{\text group}\)) = 26,690.
Mean Squares Group (\(MS_{\text group}\)) = 13,345. This is the variation among groups.
The test statistic F = 75.9.
Our P -value < 2e-16.

Row 2: Residuals

The residuals have 630 degrees of freedom (df); (n - 1 = 631 - 1 = 630).
Sum of Squares of the error deviations (\(SS_{\text error}\)) = 110,764.
Mean Squares Error (\(MS_{\text error}\)) = 176. This is the variation within groups.

Now, the point of running an ANOVA is to test if there is a difference between the means of the groups. This is a hypothesis test! Running the test alone is insufficient, we now need to draw our conclusions. What conclusions can you draw from your output?

" P < 0.05, therefore our results are statistically significant, and we reject the H0. We conclude that not all groups have equal means. Our data deviate from our null expectations, and are not due to chance. At least one group mean differs from the others. Age differs among ticket class."

Pt II: Pairwise Comparisons What if you want to know which groups are different? Here, we would conduct a post-hoc test known as the Tukey-Kramer test.

Let’s continue with the Titanic example.

Step 1: Use TukeyHSD( ) to run your post-hoc tests.

TukeyHSD(titanic_age)

Step 2: Interpret your output and state your conclusion.

First, our output gives us information on the test and the data used.
1. Line 1 tells you the ANOVA you ran before the Tukey-Kramer test, referred to as the “Fit”.
2. Line 2 tells you your categorical variable.

Then, the output presents information for each pairwise comparison of our dataset as a pairwise-comparison table. The table has 3 rows and 5 columns.

Rows: Each row gives you information on a pairwise-comparison. If you have 3 groups, you will have 3 pairwise comparisons = 3 rows. If you have 4 groups, you will have 6 pairwise comparisons = 6 rows, and so on.
Column 1: which groups are being compared
Column 2: diff = difference in means between compared groups
Column 3: lwr = lower bound of 95% confidence interval for the difference between group means
Column 4: upr = upper bound of 95% confidence interval for the difference between group means
Column 5: p adj = your p-value, adjusted for the number of tests, hence “p adj”

So, from our output, the first thing we notice is that R ran a “Tukey multiple comparisons of means”. This is exactly what we wanted to run, so we’re good so far. Now, let’s look at each line:

Our ANOVA we ran before running a Tukey test is aov(age~passenger_class, data = titanic).
The categorical variable that we are running pairwise comparisons for is passenger_class.

Now, let’s look at our output by row. Note, we have 3 rows –> 3 groups = 3 pairwise comparisons = 3 rows. Row 1 is the comparison between 2nd class and 1st class tickets, Row 2 is the comparison between 3rd class and 1st class tickets, and Row 3 is the comparison between 3rd class and 2nd class tickets.

Row 1: 2nd-1st
This row compares 2nd class and 1st class tickets.

The difference in mean ages between 2nd class and 1st class ticket holders is -11.37 years.
The lower bound of the 95% confidence interval for mean age difference of these two groups is -14.35 years.
The upper bound of the 95% confidence interval for mean age difference of these two groups is -8.39 years.
Our P -value = 0.00.

Row 2: 3rd-1st
This row compares 3rd class and 1st class tickets.

The difference in mean ages between 3rd class and 1st class ticket holders is -15.15 years.
The lower bound of the 95% confidence interval for mean age difference of these two groups is -18.19 years.
The upper bound of the 95% confidence interval for mean age difference of these two groups is -12.10 years.
Our P -value = 0.00.

Row 3: 3rd-2nd
This row compares 3rd class and 1st class tickets.

The difference in mean ages between 3rd class and 1st class ticket holders is -3.78 years.
The lower bound of the 95% confidence interval for mean age difference of these two groups is -6.87 years.
The upper bound of the 95% confidence interval for mean age difference of these two groups is -0.69 years.
Our P -value = 0.012.

Now, the point of running the Tukey-Kramer test is to find out which groups differ. What conclusions can you draw from your output?

" Group 1 and Group 2: P < 0.05, therefore our results are statistically significant, and we reject the H0. The mean of Group 1 differs from the mean of Group 2."
" Group 1 and Group 3: P < 0.05, therefore our results are statistically significant, and we reject the H0. The mean of Group 1 differs from the mean of Group 2."
" Group 2 and Group 3: P < 0.05, therefore our results are statistically significant, and we reject the H0. The mean of Group 1 differs from the mean of Group 2."

’Mean age differs between 1st class and 2nd class passengers, 1st class and 3rd class passengers, and 2nd class and 3rd class passengers."

Pt III: Nonparametric methods
For argument’s sake, let’s say you plot three histograms, one for each age category, and you find out the data are not normal. What now? You can use the Kruskal-Wallis test, an alternative to the ANOVA, using kruskal.test( ). Try it out below.

kruskal.test(age~passenger_class, data = titanic)

Let’s read through the output. The output tells you a number of things about your data:

Line 1 tells you what data you used for the test.
Line 2 states specifically your test statistic (Kruskal-Wallis chi-squared), the degrees of freedom (df), and your P -value. Remember, the P-value is the probability of obtaining a mean difference as extreme, or more extreme, than your sample mean difference. That is, it is the probability of obtaining your mean difference under the null hypothesis.

So, from our output, the first thing we notice is that R ran a “Kruskal-Wallis rank sum test”. This is exactly what we wanted to run, so we’re good so far. Now, let’s look at each line:

Our data is the age (numerical variable) by__ passenger_class (categorical variable)__.
Our test statistic (Kruskal-Wallis chi-squared) is 116.08, our degrees of freedom (df) are 2, and our P -value < 2.2e-16.

What conclusions can you draw from your output?
" P < 0.05, therefore our results are statistically significant, and we reject the H0. Our data deviate from our null expectations, and are not due to chance. At least one group mean differs from the others. Age differs among ticket class."

Now it’s time to practice on your own.

Common cuckoos (Cuculus canorus), a passerine native to Eurasia, uses a parasitic breeding strategy to avoid raising their own young. Instead of laying eggs in their own eggs, cuckoos lay their eggs in the nests of other species, for example, Eurasian Reed Warblers (Acrocephalus scirpaceus). Studies have shown that cuckoos have evolved to lay eggs that are similar in colour to the eggs of the host species, but what about egg size? Do cuckoos lay eggs similar in size to the eggs of their hosts? Data on length of cuckoo eggs laid in nests of varying host species can be found here: https://github.com/lczawadzki/biostats/raw/main/data/cuckooeggs.csv.

a. How many groups of host species are there in this dataset? b. Report your null hypothesis. c. Use ANOVA to test for a difference between host species in the mean size of the cuckoo eggs in their nests. What is your conclusion (statistical decision, and final conclusion in language relevant to the problem)? d. Assuming that your ANOVA rejected the null hypothesis, use a Tukey-Kramer test to decide which pairs of host species are significantly different from each other in cuckoo egg mean length. What is your conclusion?
Animals infected with pathogens often experience a disturbance in their circadian rhythm (i.e. their endogenous daily cycle). Shirasu-Hiza et al. (2007) tested whether the mechanism of circadian timing itself could have an effect on disease. They tested this idea by sampling three groups of fruit flies – one “normal”, one with a mutation in the timing gene tim01, and one group that was heterozygous for the tim01 mutation. Flies were exposed to a dangerous bacteria, Streptococcus pnuemoniae, and the number of days that they lived after exposure were recorded. Data on this study are here: https://github.com/lczawadzki/biostats/raw/main/data/circadian_fly.csv.

Plot a histogram of each of the three groups. Do these data match the assumptions of an ANOVA?

Use a Kruskal-Wallis test to ask whether lifespan differs between the three groups of flies.

a\. Plot a separate histogram for each fly group. Do these data match assumptions of an ANOVA?  
b\. Use a Kruskal-Wallis test to ask whether lifespan differs between the three groups of flies. What is your conclusion (statistical decision, and final conclusion in language relevant to the problem)?

R Exercise 12: Comparing more than two means (Ch 15)

Dr. Z

2026