Let’s apply what we’ve learned in Chapter 16 to R.
Objectives
New Commands:
| Arguments | Notes |
|---|---|
| df | df is name of dataframe |
| x | x is one numerical variable |
| y | y is the other numerical variable |
| anova_name | a name for your anova, can be anything |
| R Command | Notes |
|---|---|
| cor.test(df$x, df$y) | tests a correlation for your data |
| cor.test(df$x, df$y, method = “spearman”) | nonparametric alternative to correlation test (Spearman’s rank correlation) |
| plot(df$x, df$y) | plots a scatterplot for your two numerical variables |
Reminder: Save your script for practice at home! :)
Shortcuts:
| Symbol or Command | Keyboard Shortcut |
|---|---|
| <- | Alt + - |
| # | Shift + 3 |
| Run one line in script | Ctrl + Enter |
| Run entire script | Ctrl + Shift + Enter |
| Open new script | Ctrl + Shift + N |
Exercise 12
1. Open RStudio and prepare a new script.
Open a new script. All of your code for today’s exercises, and your notes and comments will go in this script. Write your filename, title, author, date, and description of the script.
# Filename: (what you will save the script as)
# Title: (give script a title)
# Author: (write your full name here)
# Date: Month Day Year (write the actual date here)
# Description: (describe what this script is for)
Pt I: Calculating the correlation coefficient and testing for significance
Correlations quantify the “scatter” between two numerical variables. We quantify this with the correlation coefficient, r. Variables are strongly correlated if r is close to 1 (Low “scatter”), and weakly correlated if r is close to 0 (high “scatter”).
In R, we use the function cor.test( ). Once you read in your data, you need the following information:
Let’s practice through example.
Example: Let’s use the dataset from Chapter 2 on ornamentation in guppies. Brooks et al. (2000) tested the inheritance of “attractive” traits from fathers to sons. For 36 father-son pairs, they quantified the ornamentation of the fathers (colour and brightness), and the attractiveness of their sons (rate of female visits). Is a father’s ornamentation associated with their son’s attractiveness?
Step 1: Read in the data (available here: https://github.com/lczawadzki/biostats/raw/main/data/guppyFatherSon.csv)
guppy <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/guppyFatherSon.csv")
Step 2: Run a categorical test with the function cor.test( ). How strong is the correlation between ornamentation in fathers and attractiveness in sons? Is the relationship statistically significant?
cor.test(guppy$fatherOrnamentation, guppy$sonAttractiveness)
Step 3: Interpret your output and state your conclusion.
Let’s read through the output. The output tells you a number of things about your data:
So, from our output, the first thing we notice is that R ran a “Pearson’s product-moment correlation”, which is the correlation test in R. This is exactly what we wanted to run, so we’re good so far. Now, let’s look at each line:
The point of running a correlation test is two-fold: (1) to quantify the relationship with a correlation coefficient, and (2) to test if there is a linear association between the two variables. What conclusions can you draw from your output?
(1) “The correlation coefficient, r = 0.614. There is a strong positive correlation between father ornamentation and son attractiveness in guppies.”
(2) “P < 0.05, therefore our results are statistically significant, and we reject the H0. Our data do NOT meet our null expectations. We conclude that there is a significant association between father ornamentation and son attractiveness.”
Pt II: Nonparametric methods - Spearman’s rank
If our data are not normally distributed, and data transformations do not make our data follow a normal distribution, we can use the Spearman’s rank correlation, which is the alternative nonparametric method to the correlation test.
Let’s practice through example.
Example: In human populations, urbanization of settlements causes disease transmission to increase. As disease transmission increases, so does the evolution of resistance to diseases. A mutation in the SLC11A1 gene in humans causes resistance to tuberculosis. Barnes et al. (2011) studied the frequency of this resistant allele in towns and villages in Europe and Asia compared to the duration of these settlements to ask whether there was an association between frequency of alleles and settlement duration.
Step 1: Read in the data (available here: https://github.com/lczawadzki/biostats/raw/main/data/TBResistance.csv)
tb <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/TBResistance.csv")
Step 2: Check that both numerical variables are normal using a histogram and QQ plot.
hist(tb$alleleFrequency)
qqnorm(tb$alleleFrequency)
qqline(tb$alleleFrequency)
hist(tb$duration)
qqnorm(tb$duration)
qqline(tb$duration)
Both histograms show obvious skew, and the QQ plots have significant curvature off of the line. The data are not normally distributed.
Step 3: Since the data are NOT normal, we need to run a Spearman’s rank correlation test instead of the parametric test. Run this test to ask: how strong is the correlation between frequency of the resistant allele and duration of settlement? Is the relationship statistically significant? You can ignore the warning.
cor.test(tb$alleleFrequency, tb$duration, method = "spearman")
## Warning in cor.test.default(tb$alleleFrequency, tb$duration, method =
## "spearman"): Cannot compute exact p-value with ties
Step 4: Interpret your output and state your conclusion.
Let’s read through the output.
The output tells you a number of things about your data:
So, from our output, the first thing we notice is that R ran a “Spearman’s rank correlation rho”, which is the correlation test in R. This is exactly what we wanted to run, so we’re good so far. Now, let’s look at each line:
Again, the point of running a correlation test is two-fold: (1) to quantify the relationship with a correlation coefficient, and (2) to test if there is a linear association between the two variables. What conclusions can you draw from your output?
(1) “The correlation coefficient, \(r_s\) = 0.715. There is a strong positive correlation between frequency of the resistant allele and duration of settlement.”
(2) “P < 0.05, therefore our results are statistically significant, and we reject the H0. Our data do NOT meet our null expectations. We conclude that there is a significant association between between frequency of the resistant allele and duration of settlement.”
Pt III: Plotting scatterplots
We can visualize these relationships with a scatterplot! Remember, the best graph for two numerical variables is a scatterplot.
To do this, we use the function plot( ), and we need two pieces of information:
Let’s make one for both examples:
1. Guppy dataset
plot(guppy$sonAttractiveness, guppy$fatherOrnamentation)
2. TB dataset
plot(tb$duration, tb$alleleFrequency)
We can change the shape of each datapoint with the argument “pch”, and the color of each datapoint with the argument, “col”. Let’s try this.
plot(guppy$sonAttractiveness, guppy$fatherOrnamentation,
pch = 20, col = "red")
plot(tb$duration, tb$alleleFrequency,
pch = 20, col = "red")
Now it’s time to practice on your own.
Larger animals tend to have larger brains. But is the increase in brain size proportional to the increase in body size? A set of data on body and brain size of 62 mammal species was collated by Allison and Cicchetti (1976), and these data are in the data set “mammals.csv”. The file contains columns giving the species name, the average body mass (in kg) and average brain size (in g) for each species. Data on this study are here: https://github.com/lczawadzki/biostats/raw/main/data/mammals.csv.
a. Plot brain size against body size. Is the relationship linear?
b. Based on part (a), which test should we run? Run this test. Is there statistical evidence that brain size is correlated with body size?
We can make the data in Question 1 bivariate normal with a log transformation. To use log-transformed data in a hypothesis test, we can write the variables in as log(df$x) rather than just df$x. Using the log-transformed data, complete the following:
a. Run the appropriate test in R to ask if there is statistical evidence that brain size is correlated with body size.
b. What is the correlation coefficient? What does this tell us about the strength of the correlation?