Topics


1 Installing and Updating R, RStudio, and R packages

Detailed installation instructions can be found on this webpage.

1.1 Installing R

1.2 Installing RStudio

1.3 Updating R

  • Windows:
    • In RStudio run:

      install.packages("installr")
      installr::updateR()
  • Apple macOS:
    • Remove the old R from Applications and reinstall using the appropriate installer.

1.4 Updating RStudio

  • Check via Help > Check for updates. If an update is available, uninstall RStudio, and then install RStudio again.

1.5 Updating R Packages

  • In RStudio run:

    options(pkgType = "binary")
    update.packages(ask = FALSE)

2 Introduction to R and RStudio

RStudio Panes and Interface

Console vs R scripts vs Rmarkdown (.Rmd) files

Code vs Comments (vs Text in Rmarkdown)


3 Basic R Functionality

3.1 Arithmetic and Calculations

R can perform basic arithmetic operations. Here are some examples:

1 + 2   # Addition
## [1] 3
5 - 3   # Subtraction
## [1] 2
2 * 3   # Multiplication
## [1] 6
1 / 2   # Division
## [1] 0.5
# Exponentiation
2^3
## [1] 8
# Square roots
sqrt(4)
## [1] 2
9^(1/2)
## [1] 3
# Standard functions such as log(), exp(), log10() also exist

Remember, order of operations matters! Use parentheses to ensure the correct order.

# This will give different results
(1 + 2) * 3   # Parentheses first
## [1] 9
1 + 2 * 3     # Multiplication first
## [1] 7

3.2 Getting help in R

?sqrt
help(sqrt)

3.3 Logical values

# Logical
TRUE
## [1] TRUE
FALSE
## [1] FALSE
!TRUE             # not operator
## [1] FALSE
2 == 3            # equal to operator
## [1] FALSE
2 != 2            # not equal to operator
## [1] FALSE
!(2 == 2)
## [1] FALSE
2 > 3
## [1] FALSE
2 <= 3
## [1] TRUE
(2 > 1) & (2 < 3) # and operator
## [1] TRUE
(2 > 1) | (2 < 3) # or operator
## [1] TRUE

3.4 Variables and Assignment

Variables are used to store values.

Use <- or = for assignment into a name.

You can later retrieve the value by calling the variable name.

x <- 5          # Assign 5 to x
y <- 10         # Assign 10 to y
total <- x + y  # Add x and y
total           # Output 15
## [1] 15

4 Vectors

Vectors are one-dimensional arrays that can hold numeric, character, or logical data. You can create vectors using the combine function, c().

# Creating a numeric vector
numbers <- c(1, 2, 3, 4, 5)

# Creating a character vector
characters <- c("apple", "banana", "cherry")

# Creating a logical vector
logicals <- c(TRUE, FALSE, TRUE)

# Accessing elements of a vector
numbers[1]       # First element
## [1] 1
characters[2:3]  # Second and third elements
## [1] "banana" "cherry"
logicals[c(1, 3)] # First and third elements
## [1] TRUE TRUE
# Vectorised operations
numbers * 2       # Multiply each element by 2
## [1]  2  4  6  8 10
numbers + 10      # Add 10 to each element
## [1] 11 12 13 14 15
numbers > 3       # Logical comparison
## [1] FALSE FALSE FALSE  TRUE  TRUE
# Member of a vector
"cherry" %in% characters
## [1] TRUE

5 Installing packages

In console:

install.packages("palmerpenguins")
install.packages("tidyverse")

Using the Packages tab in RStudio.


6 Introduction to tidyverse

This is a brief overview of tidyverse, the set of packages for data science in R.

# Load the tidyverse collection of packages
library(tidyverse)

tidyverse is a collection of packages which includes:

and more…

The good news is: you don’t need to remember which package a function comes from. When you load the tidyverse, all of these packages are loaded at once.

You can find helpful cheatsheets here.


7 palmerpenguins: A tidyverse data example

A demonstration using the penguins dataset, provided by the palmerpenguins package.

The penguins dataset contains 344 rows and 8 variables:

library(palmerpenguins)
library(tidyverse)

# Structure of dataset
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
# Number of rows and columns - also nrow() and ncol()
dim(penguins)
## [1] 344   8
# Look at the first six rows of the dataset
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>
# Summary of dataset
summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2
# A scatterplot exploring penguin body mass vs flipper length
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(color = "darkblue", alpha = 0.6) +
  labs(title = "Penguins: Flipper Length vs Body Mass",
       x = "Flipper Length (mm)",
       y = "Body Mass (g)") +
  geom_smooth(method = "lm", color = "red", se = FALSE)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

# A boxplot to explore the association between body mass and species
ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot() +
  labs(title = "Penguins: Body Mass by Species",
       x = "Species",
       y = "Body Mass (g)")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).


8 Descriptive statistics

Pipe operator: take what’s on the left and use it as the first argument of the function on the right.

take_this |>
  and_then_do_this()

To compute the number of observations (n), mean (M), and standard deviation (SD) for a variable, you can use the following code:

# Descriptive statistics for body_mass_g
penguins |>
  summarise(
    n = n(),
    M = mean(body_mass_g, na.rm = TRUE),
    SD = sd(body_mass_g, na.rm = TRUE)
  )
## # A tibble: 1 × 3
##       n     M    SD
##   <int> <dbl> <dbl>
## 1   344 4202.  802.

Without the pipe operator, you would need to always tell R where the variable is located:

nrow(penguins)
## [1] 344
mean(penguins$body_mass_g, na.rm = TRUE)
## [1] 4201.754
sd(penguins$body_mass_g, na.rm = TRUE)
## [1] 801.9545

To get the same statistics for a variable grouped by another variable (e.g., species), you can use the group_by() function:

# Descriptive statistics for body mass by species
penguins |>
  group_by(species) |>
  summarise(
    n = n(),
    M = mean(body_mass_g, na.rm = TRUE),
    SD = sd(body_mass_g, na.rm = TRUE)
  )
## # A tibble: 3 × 4
##   species       n     M    SD
##   <fct>     <int> <dbl> <dbl>
## 1 Adelie      152 3701.  459.
## 2 Chinstrap    68 3733.  384.
## 3 Gentoo      124 5076.  504.

9 Data subsetting

penguins[1, ]           # First row

penguins[, 1]           # First column - by number
penguins[ , "species"]  # First column - by name
penguins$species        # First column - by name

penguins[c(1,2,3), ]    # First three rows

penguins[, c(1,2,3)]    # First three columns - by number
penguins[, c("species", "island", "bill_length_mm")] # First three columns - by name

penguins[c(1,2,3), c(1,2,3)] # First three rows and columns and range shortcut
penguins[1:3, 1:3] # Shortcut : for range of consecutive values
penguins |>
  slice(1:3)
# Subset the data for only Adelie species
adelie_penguins <- penguins |> 
  filter(species == "Adelie")

# Remove specific rows of the penguins dataset
!is.na(penguins$body_mass_g)    # returns TRUE/FALSE for each row

# Filter out rows with NA values in body_mass_g
penguins_no_na <- penguins |> 
  filter( !is.na(body_mass_g) )

# Only keep columns species, body_mass_g
penguins_subset <- penguins_no_na |> 
  select(species, body_mass_g)

10 Creating new columns

You can create new columns in a data frame using the mutate() function from the dplyr package from tidyverse.

The syntax works as follows:

data_name |>
  mutate(new_column_name = expression)

For example:

# Create a new column for body mass in kg
penguins <- penguins |>
  mutate(body_mass_kg = body_mass_g / 1000)

# Create a new column for bill length and depth in cm
penguins <- penguins |>
  mutate(
    bill_length_cm = bill_length_mm / 10,
    bill_depth_cm = bill_depth_mm / 10
  )

If you want to overwrite an existing column, you can do so by using the same name in the mutate() function.


11 Factors

Sometimes categorical variables are not stored as characters/factors. They could be stored as numbers which represent categories.

For example, species is a categorical variable with three levels: Adélie, Chinstrap, and Gentoo. Suppose we have a dataset where species is stored as numbers (1, 2, 3) instead of characters. This can lead to confusion when analyzing the data, as the numerical representation does not convey the actual categories. Furthermore, you can’t take the mean of those values.

To convert a numeric variable to a factor, you can use the factor() function.

# Toy dataset with species as number
data_example <- tibble(species = c(1, 2, 3, 1, 2, 3))
data_example
## # A tibble: 6 × 1
##   species
##     <dbl>
## 1       1
## 2       2
## 3       3
## 4       1
## 5       2
## 6       3
mean(data_example$species) # Non-sensical mean. You can't take the mean of a categorical variable.
## [1] 2
# Convert species to a factor
data_example <- data_example |>
  mutate(species = factor(species))

mean(data_example$species) # R tells you you can't do it
## Warning in mean.default(data_example$species): argument is not numeric or
## logical: returning NA
## [1] NA
# Convert species to a factor and choose order of levels
data_example <- data_example |>
  mutate(
    species = factor(species,
                     levels = c(3, 2, 1))
  )

# Convert species to a factor and use better labels - check the data codebook for the labels
data_example <- data_example |>
  mutate(
    species = factor(species,
                     levels = c(3, 2, 1),
                     labels = c("Gentoo", "Chinstrap", "Adelie"))
  )

# What are the factor levels?
levels(data_example$species)
## [1] "Gentoo"    "Chinstrap" "Adelie"

12 Reading data into R

Consider this dataset

A file path is a string that specifies the location of a file or directory in your computer’s file system. It tells your system how to navigate through folders to locate the file. There are two main types:

Absolute Path: Specifies the complete directory list from the root folder.

Relative Path: Specifies the location relative to the current file or folder.

Paths are essential for accessing or referencing files correctly in your code.

# Read from local file
penguins_data <- read_csv("path/to/penguins.csv")

# Read from URL
penguins_data <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")

The following functions are also available to read other types of files:


13 Summary & Questions

This introductory session builds the foundation in R and tidyverse for our forthcoming four days which will focus on the linear model.