Introduction to R and RStudio

Topics

1 Installing and Updating R, RStudio, and R packages
2 Introduction to R and RStudio
3 Basic R Functionality
4 Vectors
5 Installing packages
6 Introduction to tidyverse
7 palmerpenguins: A tidyverse data example
8 Descriptive statistics
9 Data subsetting
10 Creating new columns
11 Factors
12 Reading data into R
13 Summary & Questions

1 Installing and Updating R, RStudio, and R packages

Detailed installation instructions can be found on this webpage.

1.1 Installing R

Windows:
- Uninstall any previous R and Rtools installations.
- Download installer: https://cran.r-project.org/bin/windows/base/R-4.4.3-win.exe.
- For RTools (64-bit): https://cran.r-project.org/bin/windows/Rtools/rtools44/files/rtools44-aarch64-6459-6401.exe; for older PCs, see available 32-bit versions.
Apple macOS:
- Remove any previous R and XQuartz installations from Applications.
- Download installer for your processor:
  - If Apple M1-M4: https://cran.r-project.org/bin/macosx/big-sur-arm64/base/R-4.4.3-arm64.pkg.
  - If Intel-based: https://cran.r-project.org/bin/macosx/big-sur-x86_64/base/R-4.4.3-x86_64.pkg.
- Install XQuartz: https://xquartz.macosforge.org/.
Chromebooks:
- R cannot be installed; use Posit Cloud (https://posit.cloud/).

1.2 Installing RStudio

Windows & macOS:
- Download RStudio Desktop from https://posit.co/download/rstudio-desktop/ and follow the installation instructions.
Chromebooks:
- RStudio is unavailable; use Posit Cloud (https://posit.cloud/).

1.3 Updating R

Windows:

In RStudio run:

install.packages("installr")
installr::updateR()

Apple macOS:
- Remove the old R from Applications and reinstall using the appropriate installer.

1.4 Updating RStudio

Check via Help > Check for updates. If an update is available, uninstall RStudio, and then install RStudio again.

1.5 Updating R Packages

In RStudio run:

options(pkgType = "binary")
update.packages(ask = FALSE)

2 Introduction to R and RStudio

RStudio Panes and Interface

Console: Execute commands interactively. To execute code, press Enter.
Editor: Write, modify, and save your R code. Press Ctrl+Enter or Cmd+Enter to send code to the Console for execution.
Environment: View your current objects and variables.
Files, Plots, Packages, and Help: Access your files, see your plots, and access the help page. Install/update packages (more on this later).

Console vs R scripts vs Rmarkdown (.Rmd) files

Console: The Console is used for running commands interactively. It is great for quick calculations or testing small pieces of code, but it does not save your work. Once you close RStudio, the commands in the Console are lost unless explicitly saved elsewhere.
R Scripts (.R files): R scripts are plain text files where you can write and save R code. They are ideal for creating reusable code and documenting your workflow. You can execute code from an R script by sending it to the Console (e.g., using Ctrl+Enter or Cmd+Enter).
RMarkdown (.Rmd files): RMarkdown files combine code, text, and output in a single document. They are used for creating dynamic reports, presentations, or documents that include both analysis and narrative. You can execute code chunks within an RMarkdown file and render the document into formats like HTML, PDF, or Word.

Code vs Comments (vs Text in Rmarkdown)

Code: Code is the actual R commands that are executed to perform tasks. In R scripts, code is written directly in the file. In RMarkdown, code is written inside code chunks (e.g., ```{r} ... ```).
Text in RMarkdown: In RMarkdown, text outside of code chunks is treated as narrative or explanatory text. It is written in Markdown syntax and is used to provide context, explanations, or documentation alongside the code and its output.
Comments: Comments are lines of text in R scripts (or in RMarkdown code chunks) that are not executed as code. They are used to explain or document the code. In R, comments start with #. For example:
```
# This is a comment
x <- 5  # Assign 5 to x
```

3 Basic R Functionality

3.1 Arithmetic and Calculations

R can perform basic arithmetic operations. Here are some examples:

1 + 2   # Addition

## [1] 3

5 - 3   # Subtraction

## [1] 2

2 * 3   # Multiplication

## [1] 6

1 / 2   # Division

## [1] 0.5

# Exponentiation
2^3

## [1] 8

# Square roots
sqrt(4)

## [1] 2

9^(1/2)

## [1] 3

# Standard functions such as log(), exp(), log10() also exist

Remember, order of operations matters! Use parentheses to ensure the correct order.

# This will give different results
(1 + 2) * 3   # Parentheses first

## [1] 9

1 + 2 * 3     # Multiplication first

## [1] 7

3.2 Getting help in R

?sqrt
help(sqrt)

3.3 Logical values

# Logical
TRUE

## [1] TRUE

FALSE

## [1] FALSE

!TRUE             # not operator

## [1] FALSE

2 == 3            # equal to operator

## [1] FALSE

2 != 2            # not equal to operator

## [1] FALSE

!(2 == 2)

## [1] FALSE

2 > 3

## [1] FALSE

2 <= 3

## [1] TRUE

(2 > 1) & (2 < 3) # and operator

## [1] TRUE

(2 > 1) | (2 < 3) # or operator

## [1] TRUE

3.4 Variables and Assignment

Variables are used to store values.

Use <- or = for assignment into a name.

You can later retrieve the value by calling the variable name.

x <- 5          # Assign 5 to x
y <- 10         # Assign 10 to y
total <- x + y  # Add x and y
total           # Output 15

## [1] 15

4 Vectors

Vectors are one-dimensional arrays that can hold numeric, character, or logical data. You can create vectors using the combine function, c().

# Creating a numeric vector
numbers <- c(1, 2, 3, 4, 5)

# Creating a character vector
characters <- c("apple", "banana", "cherry")

# Creating a logical vector
logicals <- c(TRUE, FALSE, TRUE)

# Accessing elements of a vector
numbers[1]       # First element

## [1] 1

characters[2:3]  # Second and third elements

## [1] "banana" "cherry"

logicals[c(1, 3)] # First and third elements

## [1] TRUE TRUE

# Vectorised operations
numbers * 2       # Multiply each element by 2

## [1]  2  4  6  8 10

numbers + 10      # Add 10 to each element

## [1] 11 12 13 14 15

numbers > 3       # Logical comparison

## [1] FALSE FALSE FALSE  TRUE  TRUE

# Member of a vector
"cherry" %in% characters

## [1] TRUE

5 Installing packages

In console:

install.packages("palmerpenguins")
install.packages("tidyverse")

Using the Packages tab in RStudio.

6 Introduction to tidyverse

This is a brief overview of tidyverse, the set of packages for data science in R.

# Load the tidyverse collection of packages
library(tidyverse)

tidyverse is a collection of packages which includes:

ggplot2 for data visualisation
dplyr for data manipulation
tidyr for data tidying
readr for data import
tibble for data frames
stringr for string manipulation
forcats for categorical variables

and more…

The good news is: you don’t need to remember which package a function comes from. When you load the tidyverse, all of these packages are loaded at once.

You can find helpful cheatsheets here.

7 palmerpenguins: A tidyverse data example

A demonstration using the penguins dataset, provided by the palmerpenguins package.

The penguins dataset contains 344 rows and 8 variables:

species - a factor denoting penguin species (Adélie, Chinstrap and Gentoo)
island - a factor denoting island in Palmer Archipelago, Antarctica (Biscoe, Dream or Torgersen)
bill_length_mm - a number denoting bill length (millimeters)
bill_depth_mm - a number denoting bill depth (millimeters)
flipper_length_mm - an integer denoting flipper length (millimeters)
body_mass_g - an integer denoting body mass (grams)
sex - a factor denoting penguin sex (female, male)
year - an integer denoting the study year (2007, 2008, or 2009)

library(palmerpenguins)
library(tidyverse)

# Structure of dataset
str(penguins)

## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

# Number of rows and columns - also nrow() and ncol()
dim(penguins)

## [1] 344   8

# Look at the first six rows of the dataset
head(penguins)

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

# Summary of dataset
summary(penguins)

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

# A scatterplot exploring penguin body mass vs flipper length
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(color = "darkblue", alpha = 0.6) +
  labs(title = "Penguins: Flipper Length vs Body Mass",
       x = "Flipper Length (mm)",
       y = "Body Mass (g)") +
  geom_smooth(method = "lm", color = "red", se = FALSE)

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

# A boxplot to explore the association between body mass and species
ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot() +
  labs(title = "Penguins: Body Mass by Species",
       x = "Species",
       y = "Body Mass (g)")

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

8 Descriptive statistics

Pipe operator: take what’s on the left and use it as the first argument of the function on the right.

take_this |>
  and_then_do_this()

To compute the number of observations (n), mean (M), and standard deviation (SD) for a variable, you can use the following code:

# Descriptive statistics for body_mass_g
penguins |>
  summarise(
    n = n(),
    M = mean(body_mass_g, na.rm = TRUE),
    SD = sd(body_mass_g, na.rm = TRUE)
  )

## # A tibble: 1 × 3
##       n     M    SD
##   <int> <dbl> <dbl>
## 1   344 4202.  802.

Without the pipe operator, you would need to always tell R where the variable is located:

nrow(penguins)

## [1] 344

mean(penguins$body_mass_g, na.rm = TRUE)

## [1] 4201.754

sd(penguins$body_mass_g, na.rm = TRUE)

## [1] 801.9545

To get the same statistics for a variable grouped by another variable (e.g., species), you can use the group_by() function:

# Descriptive statistics for body mass by species
penguins |>
  group_by(species) |>
  summarise(
    n = n(),
    M = mean(body_mass_g, na.rm = TRUE),
    SD = sd(body_mass_g, na.rm = TRUE)
  )

## # A tibble: 3 × 4
##   species       n     M    SD
##   <fct>     <int> <dbl> <dbl>
## 1 Adelie      152 3701.  459.
## 2 Chinstrap    68 3733.  384.
## 3 Gentoo      124 5076.  504.

9 Data subsetting

penguins[1, ]           # First row

penguins[, 1]           # First column - by number
penguins[ , "species"]  # First column - by name
penguins$species        # First column - by name

penguins[c(1,2,3), ]    # First three rows

penguins[, c(1,2,3)]    # First three columns - by number
penguins[, c("species", "island", "bill_length_mm")] # First three columns - by name

penguins[c(1,2,3), c(1,2,3)] # First three rows and columns and range shortcut
penguins[1:3, 1:3] # Shortcut : for range of consecutive values

penguins |>
  slice(1:3)

# Subset the data for only Adelie species
adelie_penguins <- penguins |> 
  filter(species == "Adelie")

# Remove specific rows of the penguins dataset
!is.na(penguins$body_mass_g)    # returns TRUE/FALSE for each row

# Filter out rows with NA values in body_mass_g
penguins_no_na <- penguins |> 
  filter( !is.na(body_mass_g) )

# Only keep columns species, body_mass_g
penguins_subset <- penguins_no_na |> 
  select(species, body_mass_g)

10 Creating new columns

You can create new columns in a data frame using the mutate() function from the dplyr package from tidyverse.

The syntax works as follows:

data_name |>
  mutate(new_column_name = expression)

For example:

# Create a new column for body mass in kg
penguins <- penguins |>
  mutate(body_mass_kg = body_mass_g / 1000)

# Create a new column for bill length and depth in cm
penguins <- penguins |>
  mutate(
    bill_length_cm = bill_length_mm / 10,
    bill_depth_cm = bill_depth_mm / 10
  )

If you want to overwrite an existing column, you can do so by using the same name in the mutate() function.

11 Factors

Sometimes categorical variables are not stored as characters/factors. They could be stored as numbers which represent categories.

For example, species is a categorical variable with three levels: Adélie, Chinstrap, and Gentoo. Suppose we have a dataset where species is stored as numbers (1, 2, 3) instead of characters. This can lead to confusion when analyzing the data, as the numerical representation does not convey the actual categories. Furthermore, you can’t take the mean of those values.

To convert a numeric variable to a factor, you can use the factor() function.

# Toy dataset with species as number
data_example <- tibble(species = c(1, 2, 3, 1, 2, 3))
data_example

## # A tibble: 6 × 1
##   species
##     <dbl>
## 1       1
## 2       2
## 3       3
## 4       1
## 5       2
## 6       3

mean(data_example$species) # Non-sensical mean. You can't take the mean of a categorical variable.

## [1] 2

# Convert species to a factor
data_example <- data_example |>
  mutate(species = factor(species))

mean(data_example$species) # R tells you you can't do it

## Warning in mean.default(data_example$species): argument is not numeric or
## logical: returning NA

## [1] NA

# Convert species to a factor and choose order of levels
data_example <- data_example |>
  mutate(
    species = factor(species,
                     levels = c(3, 2, 1))
  )

# Convert species to a factor and use better labels - check the data codebook for the labels
data_example <- data_example |>
  mutate(
    species = factor(species,
                     levels = c(3, 2, 1),
                     labels = c("Gentoo", "Chinstrap", "Adelie"))
  )

# What are the factor levels?
levels(data_example$species)

## [1] "Gentoo"    "Chinstrap" "Adelie"

12 Reading data into R

Consider this dataset

A file path is a string that specifies the location of a file or directory in your computer’s file system. It tells your system how to navigate through folders to locate the file. There are two main types:

Absolute Path: Specifies the complete directory list from the root folder.

Relative Path: Specifies the location relative to the current file or folder.

Paths are essential for accessing or referencing files correctly in your code.

# Read from local file
penguins_data <- read_csv("path/to/penguins.csv")

# Read from URL
penguins_data <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")

The following functions are also available to read other types of files:

read_csv() for CSV files
read_tsv() for tab-separated files
read_excel() for Excel files
library(haven) and then
- read_dta() for Stata files
- read_sav() for SPSS files
- read_sas() for SAS files

13 Summary & Questions

This introductory session builds the foundation in R and tidyverse for our forthcoming four days which will focus on the linear model.

R & RStudio installation and interface.
Basics of R with emphasis on tidyverse.
A practical data example with palmerpenguins.
Any questions?