Let’s apply what we’ve learned in Chapter 11 to R.

Objectives

New Commands:

R Command Notes
t.test(df$var, mu =) runs a one sample t-test to test whether sample data are consistent with a hypothesized population mean, where df is the name of your dataframe, var is the name of your column, and mu is the expected mean if the H0 is true
qchisq(p =, df =) gives critical value from chi-square distribution, where p is 1-alpha/2 for the lower bound of your confidence interval and alpha/2 for the upper bound of your confidence interval, and df are your degrees of freedom. DON’T CONFUSE THIS WITH pchisq


Reminder: Save your script for practice at home! :)

Shortcuts:

Symbol or Command Keyboard Shortcut
<- Alt + -
# Shift + 3
Run one line in script Ctrl + Enter
Run entire script Ctrl + Shift + Enter
Open new script Ctrl + Shift + N


Exercise 9
1. Open RStudio and prepare a new script.

Open a new script. All of your code for today’s exercises, and your notes and comments will go in this script. Write your filename, title, author, date, and description of the script.

# Filename: (what you will save the script as)
# Title: (give script a title)
# Author: (write your full name here)
# Date: Month Day Year (write the actual date here)

# Description: (describe what this script is for)


2. We are going to dive back into numerical data! When conducting research, scientists often have an expectation of a population mean, and want to ask if their sample meets these expectations. To do this, we need to undergo a hypothesis test. In this case, we will conduct a one-sample t-test.

In R, we use the function _t.test( ). Once you read in your data, you need two pieces of information:

  1. df$var - the name of your dataframe, and the column of interest
  2. mu = - the population mean proposed under the null hypothesis (H0)

Let’s practice with an example.

Example 1: We are taught that the average human body temperature is 98.6ºF. A researcher obtained body-temperature measurements for 25 randomly selected healthy individuals. Test the hypothesis that mean human body temperature is equal to 98.6º.

Pt I: Reading in your data
Read the file into R from the GitHub using the following URL: https://github.com/lczawadzki/biostats/raw/main/data/bodytemp.csv.

     bodytemp <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/bodytemp.csv")

Pt II: Running the t-test
From our data, we know that our dataframe is “bodytemp”, our column of interest is “tempF”, and our population mean proposed under the null hypothesis is 98.6ºF. Using this information, we can run t.test.

t.test(bodytemp$tempF, mu = 98.6)

Pt III: Understanding your output

Let’s read through the output. The output tells you a number of things about your data:

  1. Line 1 tells you what data you used for the test.
  2. Line 2 states specifically your test statistic (t), the degrees of freedom (df), and your P -value. Remember, the P-value is the probability of obtaining a mean as extreme, or more extreme, than your sample mean. That is, it is the probability of obtaining your mean under the null hypothesis.
  3. Line 3 tells you the 95% confidence interval of the proportion, which are values between which you are 95% confident that the true population proportion lies between.
  4. Line 4 tells you information about your sample estimates, specifically, the mean of your sample data.

So, from our output, we know that:

  1. Our data comes from the dataframe bodytemp and the variable tempF.
  2. Our test statistic (t) is -0.56, our degrees of freedom (df) are 24, and our P -value = 0.58.
  3. We are 95% confident that the true population mean lies between 98.2ºF and 98.8ºF.
  4. Our sample mean (the estimate) equals 98.5ºF.

Pt IV: Interpreting your results
The point of running a one-sample t-test is to test if there is an effect. This is a hypothesis test! Running the test alone is insufficient, we now need to draw our conclusions. What conclusions can you draw from your output? “P > 0.05, therefore our results are NOT statistically significant, and we fail to reject the H0. Our data match our null expectations, and are likely to occur due to chance. We conclude that the the sample mean does not differ from the population mean. Additionally, we are 95% confident that the true population mean lies between 98.2ºF and 98.8ºF.”

Pt V: Calculating the 95% confidence interval of the variance Sometimes we are more interested in the variance than the mean. We look at this through confidence intervals. To calculate the confidence interval of the variance, we use the following equations:
- lower bound = df * variance/chi-square(1-alpha/2, df) - upper bound = df * variance/chi-square(alpha/2, df)

To calculate the chi-square statistic to find our lower and upper bounds, we can use qchisq( ). For qchisq( ) we need to know our significance level (alpha) and our degrees of freedom.

Let’s calculate the confidence interval for our temperature data.

  1. First we calculate the variance.
var(bodytemp$tempF)
  1. Next, we determine our significance level and degrees of freedom, so we can find the chi-square statistic. We want the 95% confidence interval, so our alpha level is 0.05, and we have 25 observations, so our degrees of freedom are 25 - 1 = 24.

  2. Now we can find the chi-square statistic for our lower and upper bounds. With an alpha level of 0.05, this means our lower bound has p = 0.975, and our upper bound has p = 0.025. The maths to find this are below.

lowerq <- 1-(0.05/2)
upperq <- 0.05/2

qchisq(p = lowerq, 24) #lower bound chi-square
qchisq(p = upperq, 24)   #upper bound chi-square
  1. Use your chi-square values to calculate the confidence intervals. Remember, the formula is: df*variance/chisq
#variance = 0.4594
#lower bound chi square = 39.36408
#upper bound chi square = 12.40115

24*0.4594/39.36408 #lower bound
24*0.4594/12.40115 #upper bound

Finally, report your confidence interval. We are 95% confident that the variance of our data lies between 0.28 and 0.89.

Now it’s time to practice on your own.

  1. Let’s revisit our dataset on the human genome: https://github.com/lczawadzki/biostats/raw/main/data/genes_sample2.csv". A geneticist takes a sample of 500 genes from the human genome. It is known that the mean human gene length is 2622 nucleotides. Are the results from the geneticist’s sample consistent with the population mean?

    a. Report your null hypothesis, run your test in R, and state your conclusion (statistical and in language relevant to the problem).

  2. Let’s revisit our dataset on seal oxygen levels: https://github.com/lczawadzki/biostats/raw/main/data/sealoxygen.csv". This dataset contains information on metabolic costs of feeding and non-feeding dives (in ml O2/kg) for ten Weddell seals. It is hypothesized that the metabolic cost of a feeding dive should be 95 ml O2/kg. Test this hypothesis with your sample.

    a. Report your null hypothesis, run your test in R, and state your conclusion (statistical and in language relevant to the problem). b. Calculate the standard error. Report this with the sample mean as a statement. Recall: standard error = sd/sqrt(n).
    c. Report the 95% confidence interval of the mean as a statement. d. Calculate and report the 95% confidence interval of the variance.