R Exercise 3: Displaying and describing data (Ch 2, 3)

Let’s apply what we’ve learned in Chapters 2 and 3 to R.

Objectives

Review reading in data.
Be able to create basic graphs: histograms and box plots.
Know how to interpret histograms and boxplots.
Be able to calculate summary statistics for location (central tendency).
Be able to calculate summary statistics for spread.

Reminder: Save your script for practice at home! :)

Shortcuts:

Symbol or Command	Keyboard Shortcut
<-	Alt + -
#	Shift + 3
Run one line in script	Ctrl + Enter
Run entire script	Ctrl + Shift + Enter
Open new script	Ctrl + Shift + N

Exercise 3
1. Open RStudio and prepare a new script.

Open a new script. All of your code for today’s exercises, and your notes and comments will go in this script. Write your filename, title, author, date, and description of the script.

# Filename: (what you will save the script as)
# Title: (give script a title)
# Author: (write your full name here)
# Date: Month Day Year (write the actual date here)

# Description: (describe what this script is for)

2. Let’s work with the file “student-height.csv” in the GitHub.

Pt I: Reading in the file
Read the file into R, either directly from the GitHub using the following URL: https://github.com/lczawadzki/biostats/raw/main/data/student-height.csv, or by downloading the data, and using the pathname/working directory. Remember to assign it to a name.

     studht <- read.csv("https://github.com/lczawadzki/biostats/raw/main/data/student-height.csv", header = TRUE)

Pt II: Displaying data (view format, make histogram)
a. Look at the structure of the dataframe using the function str( ). This tells you information about the dataframe, including how many observations, how many variables, and what those variables are.

     str(studht)

## 'data.frame':    81 obs. of  1 variable:
##  $ height: int  57 60 61 61 61 61 62 62 62 62 ...

b. Use the function head( ) to look at the first few lines of the dataframe.

     head(studht)

c. Our data consists of one numerical variable. To plot one numerical variable, we use a histogram. To plot histograms in R, we can use the function hist( ). However, we cannot just type in the dataframe as the argument, we need to specify both the dataframe and the variable, using the notation: dataframe$variable.

     hist(studht$height)

d. We can change the labels by adding the argument xlab = "" for the x-axis label, and main = "" for the title of the graph. Test this out.

     hist(studht$height, xlab = "Student Height (in)", main = "Frequency of Student Height")

e. Now, look at your histogram:
- What is the shape of our distribution (symmetric, asymmetric)?
- How many peaks do we have (this is the mode)?

Pt III: Describing data (summary statistics)
a. Let’s calculate our summary statistics, using the following functions. Remember, we need to specify which variable we are looking at using the notation: dataframe$variable.

     mean(studht$height)
     median(studht$height)
     sd(studht$height)  #standard deviation
     var(studht$height) #variance
     IQR(studht$height) #interquartile range

Note: we do not have a function for the mode. To find the mode, we can always look at the peak of our histogram!

b. We also do not have a function for the coefficient of variation. We will need to calculate this directly, using our knowledge of the formula, and the functions for standard deviation (sd) and mean (mean). The formula is: CV = (sd/mean)*100%. Remember, we need to specify which variable we are looking at using the notation: dataframe$variable.

     (sd(studht$height)/mean(studht$height))*100

c. A faster way to get all of these statistics in one go is to use the summary( ) function. This will give you the minimum value, first quartile, median, mean, third quartile, and maximum value. You will still need to calculate the standard deviation (sd),variance (var), and __interquartile range (IQR) separately if you use summary( )!

     summary(studht)

##      height     
##  Min.   :57.00  
##  1st Qu.:63.00  
##  Median :66.00  
##  Mean   :66.11  
##  3rd Qu.:69.00  
##  Max.   :74.00

d. We can also make a box plot of our data, to examine the median, first quartile, third quartile, and interquartile ranges. The function for a box plot is boxplot( ). You can add labels in the same way you did with your histogram, except this time we’ll want to label the y-axis using ylab = "". The argument for boxplot is just dataframe.

     boxplot(studht, ylab = "Student Height (in)", main = "Box Plot of Student Heights")

e. Now, examine your boxplot:
- What does the black line represent?
- What do the edges of your box represent?
- What does the range from one end of the box to the other represent?
- What do the whiskers show?

Let’s practice with another file.

Read in the file “bird-richness.csv” from the GitHub using the following URL: https://github.com/lczawadzki/biostats/raw/main/data/bird-richness.csv (or by downloading the data). Assign it the name “birds”.

a. View the data using str() and head().

b. Suppose we are interested in the number of birds found at a particular location (the Richness column). We would make a histogram. Plot a histogram, using the notation: dataframe$variable.

c.Plot a boxplot of species richness using the notation: dataframe$variable.

d. What if we wanted to do a comparison of bird species richnesses in different habitats? [Possible habitats in this file are “forest” and “grassland”.] In your boxplot, you will need to use the notation y~x, where y is the variable on the y-axis, and x the variable on the x-axis. Plot the following.

     boxplot(birds$Richness ~ birds$habitat)

e. What pattern do you see? What is the effect of habitat on species richness? How do the medians vary? The IQR?

R Exercise 3: Displaying and describing data (Ch 2, 3)

Dr. Z

2026