Detailed installation instructions can be found on this webpage.
In RStudio run:
install.packages("installr")
installr::updateR()In RStudio run:
options(pkgType = "binary")
update.packages(ask = FALSE)RStudio Panes and Interface
Ctrl+Enter or Cmd+Enter to send code to the Console for execution.Console vs R scripts vs Rmarkdown (.Rmd) files
Ctrl+Enter or Cmd+Enter).Code vs Comments (vs Text in Rmarkdown)
Code: Code is the actual R commands that are executed to perform tasks. In R scripts, code is written directly in the file. In RMarkdown, code is written inside code chunks (e.g., ```{r} ... ```).
Text in RMarkdown: In RMarkdown, text outside of code chunks is treated as narrative or explanatory text. It is written in Markdown syntax and is used to provide context, explanations, or documentation alongside the code and its output.
Comments: Comments are lines of text in R scripts (or in RMarkdown code chunks) that are not executed as code. They are used to explain or document the code. In R, comments start with #. For example:
# This is a comment
x <- 5 # Assign 5 to xR can perform basic arithmetic operations. Here are some examples:
1 + 2 # Addition
## [1] 3
5 - 3 # Subtraction
## [1] 2
2 * 3 # Multiplication
## [1] 6
1 / 2 # Division
## [1] 0.5
# Exponentiation
2^3
## [1] 8
# Square roots
sqrt(4)
## [1] 2
9^(1/2)
## [1] 3
# Standard functions such as log(), exp(), log10() also exist
Remember, order of operations matters! Use parentheses to ensure the correct order.
# This will give different results
(1 + 2) * 3 # Parentheses first
## [1] 9
1 + 2 * 3 # Multiplication first
## [1] 7
?sqrt
help(sqrt)
# Logical
TRUE
## [1] TRUE
FALSE
## [1] FALSE
!TRUE # not operator
## [1] FALSE
2 == 3 # equal to operator
## [1] FALSE
2 != 2 # not equal to operator
## [1] FALSE
!(2 == 2)
## [1] FALSE
2 > 3
## [1] FALSE
2 <= 3
## [1] TRUE
(2 > 1) & (2 < 3) # and operator
## [1] TRUE
(2 > 1) | (2 < 3) # or operator
## [1] TRUE
Variables are used to store values.
Use <- or = for assignment into a name.
You can later retrieve the value by calling the variable name.
x <- 5 # Assign 5 to x
y <- 10 # Assign 10 to y
total <- x + y # Add x and y
total # Output 15
## [1] 15
Vectors are one-dimensional arrays that can hold numeric, character, or logical data. You can create vectors using the combine function, c().
# Creating a numeric vector
numbers <- c(1, 2, 3, 4, 5)
# Creating a character vector
characters <- c("apple", "banana", "cherry")
# Creating a logical vector
logicals <- c(TRUE, FALSE, TRUE)
# Accessing elements of a vector
numbers[1] # First element
## [1] 1
characters[2:3] # Second and third elements
## [1] "banana" "cherry"
logicals[c(1, 3)] # First and third elements
## [1] TRUE TRUE
# Vectorised operations
numbers * 2 # Multiply each element by 2
## [1] 2 4 6 8 10
numbers + 10 # Add 10 to each element
## [1] 11 12 13 14 15
numbers > 3 # Logical comparison
## [1] FALSE FALSE FALSE TRUE TRUE
# Member of a vector
"cherry" %in% characters
## [1] TRUE
In console:
install.packages("palmerpenguins")
install.packages("tidyverse")
Using the Packages tab in RStudio.
This is a brief overview of tidyverse, the set of packages for data science in R.
# Load the tidyverse collection of packages
library(tidyverse)
tidyverse is a collection of packages which includes:
ggplot2 for data visualisationdplyr for data manipulationtidyr for data tidyingreadr for data importtibble for data framesstringr for string manipulationforcats for categorical variablesand more…
The good news is: you don’t need to remember which package a function comes from. When you load the tidyverse, all of these packages are loaded at once.
You can find helpful cheatsheets here.
A demonstration using the penguins dataset, provided by the palmerpenguins package.
The penguins dataset contains 344 rows and 8 variables:
species - a factor denoting penguin species (Adélie, Chinstrap and Gentoo)island - a factor denoting island in Palmer Archipelago, Antarctica (Biscoe, Dream or Torgersen)bill_length_mm - a number denoting bill length (millimeters)bill_depth_mm - a number denoting bill depth (millimeters)flipper_length_mm - an integer denoting flipper length (millimeters)body_mass_g - an integer denoting body mass (grams)sex - a factor denoting penguin sex (female, male)year - an integer denoting the study year (2007, 2008, or 2009)library(palmerpenguins)
library(tidyverse)
# Structure of dataset
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
# Number of rows and columns - also nrow() and ncol()
dim(penguins)
## [1] 344 8
# Look at the first six rows of the dataset
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
# Summary of dataset
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
# A scatterplot exploring penguin body mass vs flipper length
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(color = "darkblue", alpha = 0.6) +
labs(title = "Penguins: Flipper Length vs Body Mass",
x = "Flipper Length (mm)",
y = "Body Mass (g)") +
geom_smooth(method = "lm", color = "red", se = FALSE)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
# A boxplot to explore the association between body mass and species
ggplot(penguins, aes(x = species, y = body_mass_g)) +
geom_boxplot() +
labs(title = "Penguins: Body Mass by Species",
x = "Species",
y = "Body Mass (g)")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Pipe operator: take what’s on the left and use it as the first argument of the function on the right.
take_this |>
and_then_do_this()
To compute the number of observations (n), mean (M), and standard deviation (SD) for a variable, you can use the following code:
# Descriptive statistics for body_mass_g
penguins |>
summarise(
n = n(),
M = mean(body_mass_g, na.rm = TRUE),
SD = sd(body_mass_g, na.rm = TRUE)
)
## # A tibble: 1 × 3
## n M SD
## <int> <dbl> <dbl>
## 1 344 4202. 802.
Without the pipe operator, you would need to always tell R where the variable is located:
nrow(penguins)
## [1] 344
mean(penguins$body_mass_g, na.rm = TRUE)
## [1] 4201.754
sd(penguins$body_mass_g, na.rm = TRUE)
## [1] 801.9545
To get the same statistics for a variable grouped by another variable (e.g., species), you can use the group_by() function:
# Descriptive statistics for body mass by species
penguins |>
group_by(species) |>
summarise(
n = n(),
M = mean(body_mass_g, na.rm = TRUE),
SD = sd(body_mass_g, na.rm = TRUE)
)
## # A tibble: 3 × 4
## species n M SD
## <fct> <int> <dbl> <dbl>
## 1 Adelie 152 3701. 459.
## 2 Chinstrap 68 3733. 384.
## 3 Gentoo 124 5076. 504.
penguins[1, ] # First row
penguins[, 1] # First column - by number
penguins[ , "species"] # First column - by name
penguins$species # First column - by name
penguins[c(1,2,3), ] # First three rows
penguins[, c(1,2,3)] # First three columns - by number
penguins[, c("species", "island", "bill_length_mm")] # First three columns - by name
penguins[c(1,2,3), c(1,2,3)] # First three rows and columns and range shortcut
penguins[1:3, 1:3] # Shortcut : for range of consecutive values
penguins |>
slice(1:3)
# Subset the data for only Adelie species
adelie_penguins <- penguins |>
filter(species == "Adelie")
# Remove specific rows of the penguins dataset
!is.na(penguins$body_mass_g) # returns TRUE/FALSE for each row
# Filter out rows with NA values in body_mass_g
penguins_no_na <- penguins |>
filter( !is.na(body_mass_g) )
# Only keep columns species, body_mass_g
penguins_subset <- penguins_no_na |>
select(species, body_mass_g)
You can create new columns in a data frame using the mutate() function from the dplyr package from tidyverse.
The syntax works as follows:
data_name |>
mutate(new_column_name = expression)
For example:
# Create a new column for body mass in kg
penguins <- penguins |>
mutate(body_mass_kg = body_mass_g / 1000)
# Create a new column for bill length and depth in cm
penguins <- penguins |>
mutate(
bill_length_cm = bill_length_mm / 10,
bill_depth_cm = bill_depth_mm / 10
)
If you want to overwrite an existing column, you can do so by using the same name in the mutate() function.
Sometimes categorical variables are not stored as characters/factors. They could be stored as numbers which represent categories.
For example, species is a categorical variable with three levels: Adélie, Chinstrap, and Gentoo. Suppose we have a dataset where species is stored as numbers (1, 2, 3) instead of characters. This can lead to confusion when analyzing the data, as the numerical representation does not convey the actual categories. Furthermore, you can’t take the mean of those values.
To convert a numeric variable to a factor, you can use the factor() function.
# Toy dataset with species as number
data_example <- tibble(species = c(1, 2, 3, 1, 2, 3))
data_example
## # A tibble: 6 × 1
## species
## <dbl>
## 1 1
## 2 2
## 3 3
## 4 1
## 5 2
## 6 3
mean(data_example$species) # Non-sensical mean. You can't take the mean of a categorical variable.
## [1] 2
# Convert species to a factor
data_example <- data_example |>
mutate(species = factor(species))
mean(data_example$species) # R tells you you can't do it
## Warning in mean.default(data_example$species): argument is not numeric or
## logical: returning NA
## [1] NA
# Convert species to a factor and choose order of levels
data_example <- data_example |>
mutate(
species = factor(species,
levels = c(3, 2, 1))
)
# Convert species to a factor and use better labels - check the data codebook for the labels
data_example <- data_example |>
mutate(
species = factor(species,
levels = c(3, 2, 1),
labels = c("Gentoo", "Chinstrap", "Adelie"))
)
# What are the factor levels?
levels(data_example$species)
## [1] "Gentoo" "Chinstrap" "Adelie"
Consider this dataset
A file path is a string that specifies the location of a file or directory in your computer’s file system. It tells your system how to navigate through folders to locate the file. There are two main types:
Absolute Path: Specifies the complete directory list from the root folder.
Relative Path: Specifies the location relative to the current file or folder.
Paths are essential for accessing or referencing files correctly in your code.
# Read from local file
penguins_data <- read_csv("path/to/penguins.csv")
# Read from URL
penguins_data <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")
The following functions are also available to read other types of files:
read_csv() for CSV filesread_tsv() for tab-separated filesread_excel() for Excel fileslibrary(haven) and then
read_dta() for Stata filesread_sav() for SPSS filesread_sas() for SAS filesThis introductory session builds the foundation in R and tidyverse for our forthcoming four days which will focus on the linear model.