Let’s apply what we’ve learned in Chapter 12 to R.
Objectives
New (and old) Commands:
| Arguments | Notes |
|---|---|
| df$var | df is name of dataframe, var is the column of interest |
| R Command | Notes |
|---|---|
| hist(df$var) | Makes a histogram of a variable |
| qqnorm(df$var) | Makes QQ plot of a variable |
| qqline(df$var) | Adds a line to your QQ plot for your specific variable |
| shapiro.test(df$var) | Shapiro-Wilk test for normality; if P > 0.05, this means data follow a normal distribution |
| log(df$var) | Natural log transformation (ln) of a variable |
| asin(sqrt(df$var)) | arcsine transformation of a variable |
| binom.test(x, n, p = 0.05) | Sign test for non-normal data (paired/one-sample alternative), where x is the number of successes, n is the number of trials, and p = 0.05 since we expect half the data to be above and below a null constant |
| wilcox.test(x, y) | Mann-Whitney U test for non-normal data (two-sample alternative), where x is the variable from the first group/treatment, and y is the variable from the second group/treatment |
Reminder: Save your script for practice at home! :)
Shortcuts:
| Symbol or Command | Keyboard Shortcut |
|---|---|
| <- | Alt + - |
| # | Shift + 3 |
| Run one line in script | Ctrl + Enter |
| Run entire script | Ctrl + Shift + Enter |
| Open new script | Ctrl + Shift + N |
Exercise 11
1. Open RStudio and prepare a new script.
Open a new script. All of your code for today’s exercises, and your notes and comments will go in this script. Write your filename, title, author, date, and description of the script.
# Filename: (what you will save the script as)
# Title: (give script a title)
# Author: (write your full name here)
# Date: Month Day Year (write the actual date here)
# Description: (describe what this script is for)
Pt I: Assessing Normality
t-tests rely on the assumption that data are normally distributed. These tests are fairly robust to minor deviations from the normal (especially with large sample sizes), so our data only need to be approximately normally distributed. However, if data deviate strongly from the normal distribution, we cannot use a t-test, and need to use an alternative option. Therefore, we need to check whether our data are normally distributed before we proceed with hypothesis testing. If they are normal, we can continue with our t-test. If they are not normal, we’ll use another method.
a) Graphical Methods
Visually, we can assess normality with histograms and QQ plots. Let’s try both.
a1) Histograms
Let’s start with data from the Titanic. This data contains information on the ticket class, age, sex, and survival of all passengers onboard the Titanic. Data are available here: https://github.com/lczawadzki/biostats/raw/main/data/titanic_var.csv.
titanic <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/titanic_var.csv")
hist(titanic$age)
Looking at our histogram, while we have some asymmetry, it has a clear, large mode near the center of the distribution. Non-normal data, on the other hand, would have obvious strong skews, or be strongly bimodal. Therefore, the age variable is approximately normal, and does not violate assumptions. We would be able to use a parametric test with this dataset.
Let’s practice with another dataset, this time on biomass ratio in 32 marine reserves from Halpern (2003). Data are available here: https://github.com/lczawadzki/biostats/raw/main/data/marinereserve.csv.
biomass <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/marinereserve.csv")
hist(biomass$biomassRatio)
Compared to the Titanic data, this histogram is very clearly right-skewed. From our plot, we can see that this dataset does NOT follow a normal distribution, and is non-normal. We would not be able to use a parametric test, and would need to turn to another option (i.e. transformation or a nonparametric test).
a2) QQ Plots
Now, let’s visualize normality with QQ plots using the same data.
qqnorm(titanic$age) ##creates QQ plot
qqline(titanic$age) ##adds line to QQ plot
Looking at our QQ plot, notice we have some wiggle around the straight line, but nothing substantial. Non-normal data would have substantial curvature over a large range of values or big jumps in the distribution. Therefore, the age of passengers aboard the Titanic follows a normal distribution. We would be able to use a parametric test with this dataset.
qqnorm(biomass$biomassRatio)
qqline(biomass$biomassRatio)
Compared to the Titanic data, notice this QQ plot has substantial curvature over a large range of values, particularly on the right-hand side. This is expected since our histogram showed that the data were clearly right-skewed. Biomass ratio does NOT follow a normal distribution. We would not be able to use a parametric test, and would need to turn to another option (i.e. transformation or a nonparametric test).
b) Statistical Methods (Shapiro-Wilk test)
We can also assess normaliy statistically. Let’s try this with data on biomass ratio of 32 marine reserves.
shapiro.test(biomass$biomassRatio)
The P-value is < 0.05, therefore, we reject the H0 of normality. Our data are not normal.
Pt II: Data Transformations
If data are not normal, we can first try to transform the data to achieve normality. The most common method for biological data is a log transformation (natural log; ln). Other methods include the arcsine transformation.
a) Log transformation
We use log transformations when data are ratios or products of variables, frequency distribution of data is skewed to the right, variance increases as mean gets larger (in comparisons across groups), or data span several orders of magnitude.
Let’s practice with the biomass dataset. Let’s transform our data, and then check for normality with a histogram and QQ plot.
logbiomass <- log(biomass$biomassRatio)
hist(logbiomass)
Looking at our histogram, it’s not perfect, like our Titanic data, but it does have a clear, large mode near the center of the distribution, and is not strongly skewed. Therefore, we can say that the log transformation of biomass approximates a normal distribution, and we can run parametric tests with this dataset.
qqnorm(logbiomass)
qqline(logbiomass)
It’s definitely better. Not perfect, but better. Normal QQ plots will always have some wiggle. We would be able to proceed with this.
Pt III: Nonparametric methods
If data transformations do not make our data follow a normal distribution, there are alternative nonparametric methods we can use to run hypotheses tests with our data. We can use the Sign test in place of a paired/one-sample t-test, and the Mann-Whitney U test in place of the two-sample t-test.
a) Sign test (paired and one-sample alternative)
The sign test ranks all of the data as being above or below a certain null constant. Data are either greater than the constant (+) or less than the constant (-). The + values are counted, as are the - values, and we can then run a binomial test where the + values are our “successes” and the total + and - values are the number of trials.
Let’s practice with our biomass dataset. Scientists hypothesized that the biomass ratio in marine reserves should be equal to 1, therefore our null constant for this example is 1. 31 were ranked to be above 1 (+), and 1 was ranked below 1 (-). That is, we have x = 31 successes, n = 32 trials. We expect half of our data to be above 1 and half to be below one, so p = 0.5.
binom.test(x = 31, n = 32, p = 0.05)
We see that the P -value < 2.2e-16, so we reject the H0 that the biomass ratio of marine reserves equals 1.
The point of running a Sign test is to test if there is an effect. This is a hypothesis test! Running the test alone is insufficient, we now need to draw our conclusions. What conclusions can you draw from your output? “P < 0.05, therefore our results are statistically significant, and we reject the H0. Our data deviate from our null expectations, and are not due to chance. We conclude that the biomass ratio of marine reserves is greater than unprotected areas.”
b) Mann-Whitney U test (two-sample alternative)
The Mann-Whitney U test ranks all of the data from smallest to largest, and sums ranks for all individuals within each group.
Let’s practice with data on sage crickets. Johnson et al. (1999) wanted to test whether females that had been starved for at least two days were faster to mate than those that were fed during the same period.
crickets <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/crickets.csv")
hist(crickets$starved)
hist(crickets$fed)
We can clearly see that each group has a strong right skew. We cannot use parametric methods. Data also do not become normal when log-transformed, so we need to use a nonparametric test.
wilcox.test(crickets$starved, crickets$fed)
Let’s read through the output. The output tells you a number of things about your data:
So, from our output, the first thing we notice is that R ran a “Wilcoxon rank sum test”, which is the Mann-Whitney U test in R. This is exactly what we wanted to run, so we’re good so far. Now, let’s look at each line:
The point of running a Mann-Whitney U test is to test if there is an effect. This is a hypothesis test! Running the test alone is insufficient, we now need to draw our conclusions. What conclusions can you draw from your output? “P > 0.05, therefore our results are NOT statistically significant, and we fail to reject the H0. Our data meet our null expectations, and are due to chance. We conclude that there is no difference in time to mating between starved and fed females; i.e. the shape of each distribution is the same.”
Steps Recap:
1. Is this a paired or unpaired design?
2. Is our data normal? Yes - step 3. No - step 4.
3. Run a parametric test based on design type. Paired - paired t-test. Unpaired - two-sample t-test (check for equal variances first!).
4. Does transforming the data make it normal? Yes - parametric test with transformed data. No - step 5.
5. Run a non-parametric test based on design type. Paired - Sign test. Unpaired - Mann-Whitney U test.
Now it’s time to practice on your own.
It can be difficult to know when a QQ plot is normal without visual experience. The best way to build up your visual practice is to plot lots of QQ plots of normal distributions, to get an understanding of the variety of QQ plots that are considered “normal”.
a. Generate a list of 10 random numbers from a normal distribution with mean 15 and standard deviation 3, using the following command: norm <- rnorm(n = 10, mean = 15, sd = 3)
b. Use hist() to plot a histogram of these numbers from part a.
c. Use qqnorm() to plot a QQ plot from the numbers in part a.
d. Repeat (a) through (c) at least a dozen times. For each, look at the histograms and QQ plots. Think about the ways in which these look different from your expectations of a normal distribution graph. Remember, all of these samples come from a truly normal population, so these are simulated examples of what real normal data looks like!
In 1898, after a severe winter storm in Providence, Rhode Island, hundreds of house sparrows (Passer domesticus) were found immobilized on the ground. 136 were brought to the Anatomical Laboratory of Brown University, where Hermon Bumpus (1898) studied them. 76 recovered, while 64 died. Bumpus identified the sex and age of each individual, noted whether they survived or died, and took a series of morphological measurements, including weight, skull width, and femur length. He used this data to ask questions about natural selection, and whether natural selection acted on any of the measured traits in these birds. Data can be found here: https://github.com/lczawadzki/biostats/raw/main/data/bumpus.csv. Did natural selection act on body weight of house sparrows (i.e. do birds that survived differ in mean body weight from those that died)?
a. Is this a paired or two-sample design?
b. Do the distributions for each group (survived vs. dead) look approximately normal? Note: Our data are grouped into a numerical variable column (weight_g) and a categorical variable column (survival). To plot separate histograms for survived and died, use the following notation: hist(df$numvar[which(df$catvar == “category”)],)
c. Are the variances between groups equal? Run an F-test.
d. Based on (a) (b) and (c), which test should you run?
e. Report your null hypothesis, run your test in R, and state your conclusion (statistical decision and in language relevant to the problem).
Swimmers often get ear infections from water remaining trapped in the ear canal. 14 swimmers were given a medication that claimed to reduce ear infections. The number of ear infections they developed before starting the medication, and the number of ear infections they developed after starting the medication, were recorded. Data can be found here: https://github.com/lczawadzki/biostats/raw/main/data/swimmers.csv. Does this medication reduce ear infections in swimmers?
a. Is the data normally distributed? Note: this is a paired study design, so the DIFFERENCES between groups should be normally distributed, NOT the groups themselves.
b. Based on your histogram, which test should you run?
c. Report your null hypothesis, run your test in R, and state your conclusion (statistical decision and in language relevant to the problem).