Let’s apply what we’ve learned in Chapter 16 to R.

Objectives

New Commands:

Arguments Notes
df df is name of dataframe
x x is one numerical variable
y y is the other numerical variable
anova_name a name for your anova, can be anything
R Command Notes
cor.test(df$x, df$y) tests a correlation for your data
cor.test(df$x, df$y, method = “spearman”) nonparametric alternative to correlation test (Spearman’s rank correlation)
plot(df$x, df$y) plots a scatterplot for your two numerical variables


Reminder: Save your script for practice at home! :)

Shortcuts:

Symbol or Command Keyboard Shortcut
<- Alt + -
# Shift + 3
Run one line in script Ctrl + Enter
Run entire script Ctrl + Shift + Enter
Open new script Ctrl + Shift + N


Exercise 12
1. Open RStudio and prepare a new script.

Open a new script. All of your code for today’s exercises, and your notes and comments will go in this script. Write your filename, title, author, date, and description of the script.

# Filename: (what you will save the script as)
# Title: (give script a title)
# Author: (write your full name here)
# Date: Month Day Year (write the actual date here)

# Description: (describe what this script is for)


Pt I: Calculating the correlation coefficient and testing for significance
Correlations quantify the “scatter” between two numerical variables. We quantify this with the correlation coefficient, r. Variables are strongly correlated if r is close to 1 (Low “scatter”), and weakly correlated if r is close to 0 (high “scatter”).

In R, we use the function cor.test( ). Once you read in your data, you need the following information:

  1. x - your first numerical variable
  2. y - your second numerical variable

Let’s practice through example.

Example: Let’s use the dataset from Chapter 2 on ornamentation in guppies. Brooks et al. (2000) tested the inheritance of “attractive” traits from fathers to sons. For 36 father-son pairs, they quantified the ornamentation of the fathers (colour and brightness), and the attractiveness of their sons (rate of female visits). Is a father’s ornamentation associated with their son’s attractiveness?

Step 1: Read in the data (available here: https://github.com/lczawadzki/biostats/raw/main/data/guppyFatherSon.csv)

guppy <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/guppyFatherSon.csv")

Step 2: Run a categorical test with the function cor.test( ). How strong is the correlation between ornamentation in fathers and attractiveness in sons? Is the relationship statistically significant?

cor.test(guppy$fatherOrnamentation, guppy$sonAttractiveness)

Step 3: Interpret your output and state your conclusion.

Let’s read through the output. The output tells you a number of things about your data:

  1. Line 1 tells you what data you used for the test.
  2. Line 2 states specifically your test statistic (t), your degrees of freedom (df) and your P -value. Remember, the P-value is the probability of obtaining a mean difference as extreme, or more extreme, than your sample mean difference. That is, it is the probability of obtaining your mean difference under the null hypothesis.
  3. Line 3 tells you the alternative hypothesis.
  4. Line 4 tells you the 95% confidence interval of the proportion, which are values between which you are 95% confident that the true population proportion lies between.
  5. Line 5 tells you information about your sample estimates, specifically, the correlation coefficient, r, of your sample data.

So, from our output, the first thing we notice is that R ran a “Pearson’s product-moment correlation”, which is the correlation test in R. This is exactly what we wanted to run, so we’re good so far. Now, let’s look at each line:

  1. Our data comes from the dataframe guppy and our variables are fatherOrnamentation and sonAttractiveness.
  2. Our test statistic (t) is 4.5371, our degrees of freedom (df) are 34, and our P -value = 6.784e-05.
  3. Our alternative hypothesis is that the true correlation is not equal to 0.
  4. We are 95% confident that the true correlation coefficient lies between 0.358 and 0.784.
  5. Our sample correlation coefficient, r, (the estimate) equals 0.614.

The point of running a correlation test is two-fold: (1) to quantify the relationship with a correlation coefficient, and (2) to test if there is a linear association between the two variables. What conclusions can you draw from your output?
(1) “The correlation coefficient, r = 0.614. There is a strong positive correlation between father ornamentation and son attractiveness in guppies.”
(2) “P < 0.05, therefore our results are statistically significant, and we reject the H0. Our data do NOT meet our null expectations. We conclude that there is a significant association between father ornamentation and son attractiveness.”

Pt II: Nonparametric methods - Spearman’s rank
If our data are not normally distributed, and data transformations do not make our data follow a normal distribution, we can use the Spearman’s rank correlation, which is the alternative nonparametric method to the correlation test.

Let’s practice through example.

Example: In human populations, urbanization of settlements causes disease transmission to increase. As disease transmission increases, so does the evolution of resistance to diseases. A mutation in the SLC11A1 gene in humans causes resistance to tuberculosis. Barnes et al. (2011) studied the frequency of this resistant allele in towns and villages in Europe and Asia compared to the duration of these settlements to ask whether there was an association between frequency of alleles and settlement duration.

Step 1: Read in the data (available here: https://github.com/lczawadzki/biostats/raw/main/data/TBResistance.csv)

tb <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/TBResistance.csv")

Step 2: Check that both numerical variables are normal using a histogram and QQ plot.

hist(tb$alleleFrequency)
qqnorm(tb$alleleFrequency)
qqline(tb$alleleFrequency)

hist(tb$duration)
qqnorm(tb$duration)
qqline(tb$duration)

Both histograms show obvious skew, and the QQ plots have significant curvature off of the line. The data are not normally distributed.

Step 3: Since the data are NOT normal, we need to run a Spearman’s rank correlation test instead of the parametric test. Run this test to ask: how strong is the correlation between frequency of the resistant allele and duration of settlement? Is the relationship statistically significant? You can ignore the warning.

cor.test(tb$alleleFrequency, tb$duration, method = "spearman")
## Warning in cor.test.default(tb$alleleFrequency, tb$duration, method =
## "spearman"): Cannot compute exact p-value with ties

Step 4: Interpret your output and state your conclusion.

Let’s read through the output.
The output tells you a number of things about your data:

  1. Line 1 tells you what data you used for the test.
  2. Line 2 states specifically your test statistic (S), and your P -value. Remember, the P-value is the probability of obtaining a mean difference as extreme, or more extreme, than your sample mean difference. That is, it is the probability of obtaining your mean difference under the null hypothesis.
  3. Line 3 tells you the alternative hypothesis.
  4. Line 4 tells you information about your sample estimates, specifically, the correlation coefficient, \(r_s\), of your sample data.

So, from our output, the first thing we notice is that R ran a “Spearman’s rank correlation rho”, which is the correlation test in R. This is exactly what we wanted to run, so we’re good so far. Now, let’s look at each line:

  1. Our data comes from the dataframe tb and our variables are alleleFrequency and duration.
  2. Our test statistic (S) is 232.64, and our P -value = 0.001258.
  3. Our alternative hypothesis is that the true rhho is not equal to 0.
  4. Our sample correlation coefficient, \(r_s\), (the estimate) equals 0.715.

Again, the point of running a correlation test is two-fold: (1) to quantify the relationship with a correlation coefficient, and (2) to test if there is a linear association between the two variables. What conclusions can you draw from your output?
(1) “The correlation coefficient, \(r_s\) = 0.715. There is a strong positive correlation between frequency of the resistant allele and duration of settlement.”
(2) “P < 0.05, therefore our results are statistically significant, and we reject the H0. Our data do NOT meet our null expectations. We conclude that there is a significant association between between frequency of the resistant allele and duration of settlement.”

Pt III: Plotting scatterplots
We can visualize these relationships with a scatterplot! Remember, the best graph for two numerical variables is a scatterplot.

To do this, we use the function plot( ), and we need two pieces of information:

  1. x - your first numerical variable
  2. y - your second numerical variable

Let’s make one for both examples:
1. Guppy dataset

plot(guppy$sonAttractiveness, guppy$fatherOrnamentation)

2. TB dataset

plot(tb$duration, tb$alleleFrequency)

We can change the shape of each datapoint with the argument “pch”, and the color of each datapoint with the argument, “col”. Let’s try this.

plot(guppy$sonAttractiveness, guppy$fatherOrnamentation,
     pch = 20, col = "red")
plot(tb$duration, tb$alleleFrequency,
     pch = 20, col = "red")


Now it’s time to practice on your own.

  1. Larger animals tend to have larger brains. But is the increase in brain size proportional to the increase in body size? A set of data on body and brain size of 62 mammal species was collated by Allison and Cicchetti (1976), and these data are in the data set “mammals.csv”. The file contains columns giving the species name, the average body mass (in kg) and average brain size (in g) for each species. Data on this study are here: https://github.com/lczawadzki/biostats/raw/main/data/mammals.csv.

    a. Plot brain size against body size. Is the relationship linear?
    b. Based on part (a), which test should we run? Run this test. Is there statistical evidence that brain size is correlated with body size?

  2. We can make the data in Question 1 bivariate normal with a log transformation. To use log-transformed data in a hypothesis test, we can write the variables in as log(df$x) rather than just df$x. Using the log-transformed data, complete the following:

    a. Run the appropriate test in R to ask if there is statistical evidence that brain size is correlated with body size.
    b. What is the correlation coefficient? What does this tell us about the strength of the correlation?