Be sure to check the solutions to last week’s exercises.
You can still ask any questions about previous weeks’ materials if things aren’t clear!

LEARNING OBJECTIVES

LO1: Understand the appropriate visualization for categorical data.
LO2: Understand methods to calculate the spread of categorical data.
LO3: Understand methods to calculate central tendency for categorical data

Before we get started on the statistics, we’re going to briefly introduce a crucial bit of R code.

IMPORTANT READING: A different style of R code. Click the plus to expand →

We have seen already seen a few examples of R code such as:

# show the dimensions of the data
dim(somedata)

# show a summary of the data
summary(somedata)

# factorise and show the "somevariable" variable in the "somedata" dataframe 
as.factor(somedata$somevariable)

And we can actually wrap functions inside functions:

# factorise the "somevariable" variable in the "somedata" dataframe, 
# then show a summary of it
summary(as.factor(somedata$somevariable))

R evaluates code from the inside-out!

You can end up with functions inside functions inside functions …

# Don't worry about what all these functions do, 
# it's just an example -
round(mean(log(cumsum(diff(1:10)))))

## [1] 1

We can write in a different style, however, and this may help to keep code tidy and easily readable - we can write sequentially:

Notice that what we are doing is using a new symbol: %>%
This symbol takes the output of whatever is on it’s left-hand side, and uses it as an input for whatever is on the right-hand side. The %>% symbol gets called a “pipe”.

Let’s see it in action with the starwars2 dataset from the previous lab. First we need to load the tidyverse packages, because that is where %>% is found.

library(tidyverse)

starwars2 %>%
    summary()

##      name               height       hair_color         eye_color        
##  Length:75          Min.   : 79.0   Length:75          Length:75         
##  Class :character   1st Qu.:167.5   Class :character   Class :character  
##  Mode  :character   Median :180.0   Mode  :character   Mode  :character  
##                     Mean   :176.1                                        
##                     3rd Qu.:191.0                                        
##                     Max.   :264.0                                        
##   homeworld           species         
##  Length:75          Length:75         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

We can now write code that requires reading it from the inside-out:

summary(as.factor(starwars2$homeworld))

or which requires reading it from left to right:

starwars2$homeworld %>%
    as.factor() %>%
    summary()

##       Alderaan    Aleen Minor         Bespin     Bestine IV Cato Neimoidia 
##              3              1              1              1              1 
##          Cerea       Champala      Chandrila   Concord Dawn       Corellia 
##              1              1              1              1              2 
##      Coruscant       Dathomir          Dorin          Endor         Eriadu 
##              3              1              1              1              1 
##       Geonosis    Glee Anselm     Haruun Kal        Iktotch       Iridonia 
##              1              1              1              1              1 
##          Kalee         Kamino       Kashyyyk      Malastare         Mirial 
##              1              3              2              1              2 
##       Mon Cala     Muunilinst          Naboo      Nal Hutta           Ojom 
##              1              1              8              1              1 
##        Quermia          Rodia         Ryloth        Serenno          Shili 
##              1              1              2              1              1 
##          Skako        Socorro    Springfield        Stewjon        Sullust 
##              1              1              2              1              1 
##       Tatooine       Toydaria      Trandosha        Troiken           Tund 
##             10              1              1              1              1 
##         Utapau        Vulpter          Zolan 
##              1              1              1

And that long line of code from above:

# again, don't worry about all these functions, 
# just notice the difference in the two styles.
round(mean(log(cumsum(diff(1:10)))))

becomes:

1:10 %>%
    diff() %>%
    cumsum() %>%
    log() %>%
    mean() %>%
    round()

We’re going to use this way of writing a lot throughout the course, and it pairs really well with a group of functions in the tidyverse packages, which were designed to be used in conjunction with %>%, a couple of which you will learn about in this lab.

Data Exploration

Once we have collected some data, one of the first things we want to do is explore it - and we can do this through describing (or summarising) and visualising variables.

We are already familiar with the function summary(), which provides high-level information about our data, showing us things such as the minimum and maximum and mean of continuous variables, or the numbers of entries falling into each possible response level for a categorical variable:

summary(starwars2)

##      name               height       hair_color         eye_color        
##  Length:75          Min.   : 79.0   Length:75          Length:75         
##  Class :character   1st Qu.:167.5   Class :character   Class :character  
##  Mode  :character   Median :180.0   Mode  :character   Mode  :character  
##                     Mean   :176.1                                        
##                     3rd Qu.:191.0                                        
##                     Max.   :264.0                                        
##   homeworld           species         
##  Length:75          Length:75         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

What we are doing here is providing numeric descriptions of the distributions of values in each variable.

Distribution

The distribution of a variable shows how often different values occur. In this lab, we’re going to focus on describing and visualising distributions of categorical data.

The graph showing the distribution of a variable shows us where the values are centred, how the values vary, and gives some information about where a typical value might fall. It can also alert you to the presence of outliers (unexpected observations).

Unordered Categorical (Nominal) Data

For variables with a discrete number of response options, we can easily measure “how often” values occur in terms of their frequency.

Frequency distribution

A frequency distribution is an overview of all distinct values in some variable and the number of times they occur.

Supposing that we have surveyed the people working in a psychology department and asked them what sub-discipline of psychological research they most strongly identify as working within (If you would like to work along with the reading, the data is available at https://uoepsy.github.io/data/psych_survey.csv).

Variable Name	Description
participant	Subject identifier
area	Respondent’s sub-discpline of psychology

First, we read our data in to R and store it in an object called “psych_disciplines”:

psych_disciplines <- read_csv("https://uoepsy.github.io/data/psych_survey.csv")
psych_disciplines

## # A tibble: 74 x 2
##    participant   area                  
##    <chr>         <chr>                 
##  1 respondent_1  Differential          
##  2 respondent_2  Social                
##  3 respondent_3  Differential          
##  4 respondent_4  Social                
##  5 respondent_5  Differential          
##  6 respondent_6  Differential          
##  7 respondent_7  Language              
##  8 respondent_8  Language              
##  9 respondent_9  Cognitive Neuroscience
## 10 respondent_10 Language              
## # … with 64 more rows

We can get the frequencies of different response levels of the discipline variable by using the following code:

# start with the psych_disciplines dataframe 
# %>%
# count() the values in the "area" variable 
psych_disciplines %>%
    count(area)

## # A tibble: 5 x 2
##   area                       n
##   <chr>                  <int>
## 1 Cognitive Neuroscience    24
## 2 Developmental             10
## 3 Differential              20
## 4 Language                   9
## 5 Social                    11

Extra detail on how this works. Click the plus to expand →

In the code above, R knows to look for the area variable inside the psych_disciplines data because we used %>% to “pipe” in the psych_disciplines dataframe.
We could have also done:

# count(data, variable)
count(psych_disciplines, area)

## # A tibble: 5 x 2
##   area                       n
##   <chr>                  <int>
## 1 Cognitive Neuroscience    24
## 2 Developmental             10
## 3 Differential              20
## 4 Language                   9
## 5 Social                    11

But this would not work:

count(area)

Error in group_vars(x) : object ‘area’ not found

Frequency table

To describe a distribution like this, we can simply provide the frequency table.
Let’s store it as an object in R:

# make a new object called "freq_table", and assign it:
# the counts of values of "area" variable in 
# the psych_discipline dataframe.
freq_table <- 
    psych_disciplines %>%
    count(area)

# show the object called "freq_table"
freq_table

## # A tibble: 5 x 2
##   area                       n
##   <chr>                  <int>
## 1 Cognitive Neuroscience    24
## 2 Developmental             10
## 3 Differential              20
## 4 Language                   9
## 5 Social                    11

For a report, we might want to make it a little more easily readable:

Central tendency

Often, we might want to summarise data into a single summary value, reflecting the point at (or around) which most of the values tend to cluster. This is known as a measure of central tendency. For numeric data, we can use measures such as the mean, which you will likely have heard of. For nominal data (unordered categorical data), however, our only option is to use the mode.

Mode

The most frequent value (the value that occurs the greatest number of times).

In our case, the mode is the “Cognitive Neuroscience” category.

Relative frequencies

We might alternatively want to show the percentage of respondents in each category, rather than the raw frequencies.
The percentages show the relative frequency distribution

Relative frequency distribution

A relative frequency distribution shows the proportion of times each value occurs
(contrast this with the frequency distribution which shows the number of times).
Relative frequencies can be written as fractions, percents, or decimals.

In the object “freq_table”, we have a variable called n, which is the frequencies (the number in each category).
The total of this column is equal to the total number of respondents:

# sum all the values in the "n" variable in the "freq_table" object
sum(freq_table$n)

## [1] 74

And therefore, each value in freq_table$n, divided by the total, is equal to the proportion in each category:
( Tip: Proportions are percentages/100. So 0.4 is another way of expressing 40%)

# take the values in the "n" variable from the "freq_table" object, 
# and divide by the sum of all the values in the "n" variable in "freq_table"
freq_table$n/sum(freq_table$n)

## [1] 0.3243243 0.1351351 0.2702703 0.1216216 0.1486486

We can then simply add the proportions as a new column to our table of frequencies by assigning the values we just calculated to a new variable:

# the variable "prop" in the "freq_table" object is now assigned 
# the values we calculated above (the proportions)
freq_table$prop <- freq_table$n/sum(freq_table$n)

# print the "freq_table" object
freq_table

## # A tibble: 5 x 3
##   area                       n  prop
##   <chr>                  <int> <dbl>
## 1 Cognitive Neuroscience    24 0.324
## 2 Developmental             10 0.135
## 3 Differential              20 0.270
## 4 Language                   9 0.122
## 5 Social                    11 0.149

However, we can also do this within a sequence of pipes (%>%). To do so, we use a new function called mutate().

mutate()

The mutate() function is used to add or modify variables to data.

# take the data
# %>%
# mutate it, such that there is a variable called "newvariable", which
# has the values of a variable called "oldvariable" multiplied by two.
data %>%
  mutate(
    newvariable = oldvariable * 2
  )

Note: Inside mutate(), we don’t have to keep using the dollar sign $, as we have already told it what data to look for variables in.

To ensure that our additions/modifications of variables are stored in R’s environment (rather than simply printed out), we need to reassign the name of our dataframe:

data <- 
  data %>%
  mutate(
    newvariable = oldvariable * 2
  )

We can actually add this step to our earlier code:

# make a new object called "freq_table", and assign it:
# the counts of values of "area" variable in 
# the psych_discipline dataframe.
# from there, 'mutate' such that there is a variable called "prop" which
# has the values of the "n" variable divided by the sum of the "n" variable.
freq_table <- 
  psych_disciplines %>%
  count(area) %>%
  mutate(
    prop = n/sum(n)
  )

# show the object called "freq_table"
freq_table

## # A tibble: 5 x 3
##   area                       n  prop
##   <chr>                  <int> <dbl>
## 1 Cognitive Neuroscience    24 0.324
## 2 Developmental             10 0.135
## 3 Differential              20 0.270
## 4 Language                   9 0.122
## 5 Social                    11 0.149

Visualising

“By visualizing information, we turn it into a landscape that you can explore with your eyes. A sort of information map. And when you’re lost in information, an information map is kind of useful.” – David McCandless

We’re going to now make our first steps into the world of data visualisation. R is an incredibly capable language for creating visualisations of almost any kind. It is used by many media companies (e.g., the BBC), and has the capability of producing 3d visualisations, animations, interactive graphs, and more.

We are going to use the most popular R package for visualisation, ggplot2. This is actually part of the tidyverse, so if we have an Rmarkdown document and have loaded the tidyverse packages at the start (by using library(tidyverse)), then ggplot2 will be loaded too).

Recall our frequency distribution table:

# show the object called "freq_table"
freq_table

## # A tibble: 5 x 3
##   area                       n  prop
##   <chr>                  <int> <dbl>
## 1 Cognitive Neuroscience    24 0.324
## 2 Developmental             10 0.135
## 3 Differential              20 0.270
## 4 Language                   9 0.122
## 5 Social                    11 0.149

We can plot these values as a bar chart:

ggplot(data = freq_table, aes(x = area, y = n)) +
    geom_col()

$Artwork by \@allison_horst$

Figure 1: Artwork by @allison_horst

ggplot components

Note the key components of the ggplot code.

data = where we provide the name of the dataframe.
aes = where we provide the aesthetics. These are things which we map from the data to the graph. For instance, the x-axis, or if we wanted to colour the columns/bars according to some aspect of the data.

Then we add (using +) some geometry. These are the shapes (in our case, the columns/bars), which will be put in the correct place according to what we specified in aes().

+ geom_col() Adds columns to the plot.

Optional - Different aes() and geoms, and labels. Click the plus to expand →

Use these as reference for when you want to make changes to the plots you create.
Additionall, remember that google is your friend - there are endless forums with people asking how to do something in ggplot, and you can just copy and paste bits of code to add to your plots!

Fill the geoms:

ggplot(data = freq_table, aes(x = area, y = n, fill = area)) +
    geom_col()

2. Change the axis labels:

ggplot(data = freq_table, aes(x = area, y = n, fill = area)) +
    geom_col()+
    labs(title="Counts of respondents by sub-discipline", y = "Number of respondents", x = "Sub-discipline")

3. Change the geom:
(Note that using geom_col had the y axis starting at 0, but geom_point starts just below the lowest value.

# note that we also need to change "fill = area" to "col = area". 
ggplot(data = freq_table, aes(x = area, y = n, col = area)) +
    geom_point()+
    labs(title="Counts of respondents by sub-discipline", y = "Number of respondents", x = "Sub-discipline")

4. Change the limits of the axes:

ggplot(data = freq_table, aes(x = area, y = n, fill = area)) +
    geom_col()+
    labs(title="Counts of respondents by sub-discipline", y = "Number of respondents", x = "Sub-discipline")+
    ylim(0,50)

5. Remove (or reposition) the legend:

# setting legend.position as "bottom" would put it at the bottom!

ggplot(data = freq_table, aes(x = area, y = n, fill = area)) +
    geom_col()+
    labs(title="Counts of respondents by sub-discipline", y = "Number of respondents", x = "Sub-discipline")+
    ylim(0,50)+
    theme(legend.position = "none")

6. Changing the theme:

# there are many predefine themes, including: 
# theme_bw(), theme_classic(), theme_light()

ggplot(data = freq_table, aes(x = area, y = n, fill = area)) +
    geom_col()+
    labs(title="Counts of respondents by sub-discipline", y = "Number of respondents", x = "Sub-discipline")+
    ylim(0,50)+
    theme_minimal()

Ordered Categorical (Ordinal) Data

Recall that ordinal data is categorical data which has a natural ordering of the possible responses. One of the most common examples of ordinal data which you will encounter in psychology is the Likert Scale. You will probably have come across these before, perhaps when completing online surveys or questionnaires.

Likert Scale

A five or seven point scale on which an individual express how much they agree or disagree with a particular statement.

With Likert data, there is a set of discrete response options (it is categorical data). The response options can be ranked, making it ordered categorical ( strongly disagree < disagree < neither < agree < strongly agree ). Importantly, the distance between responses is not measurable.

Frequency table

Let’s suppose that as well as collecting information on the sub-discipline of psychology they identified with, we also asked our respondents to rate their level of happiness from 1 to 5, as well as their job satisfaction from 1 to 5.

Variable Name	Description
participant	Subject identifier
happiness	Respondent’s level of happiness from 1 to 5
job_sat	Respondent’s level of job satisfaction from 1 to 5

psych_survey <- read_csv("https://uoepsy.github.io/data/psych_survey2.csv")
psych_survey

## # A tibble: 74 x 3
##    participant   happiness job_sat
##    <chr>             <dbl>   <dbl>
##  1 respondent_1          3       3
##  2 respondent_2          3       4
##  3 respondent_3          2       5
##  4 respondent_4          4       5
##  5 respondent_5          3       5
##  6 respondent_6          4       4
##  7 respondent_7          4       2
##  8 respondent_8          5       5
##  9 respondent_9          1       5
## 10 respondent_10         3       4
## # … with 64 more rows

For these questions (variables happiness and job_sat), we could do the same thing as we did above for unordered categorical data, and summarise this into frequencies:

# take the "psych_survey" dataframe %>%
# count() the values in the "happiness" variable 
psych_survey %>%
    count(happiness)

## # A tibble: 5 x 2
##   happiness     n
##       <dbl> <int>
## 1         1     6
## 2         2    13
## 3         3    27
## 4         4    21
## 5         5     7

# take the "psych_survey" dataframe %>%
# count() the values in the "job_sat" variable 
psych_survey %>%
    count(job_sat)

## # A tibble: 5 x 2
##   job_sat     n
##     <dbl> <int>
## 1       1     3
## 2       2     6
## 3       3    11
## 4       4    16
## 5       5    38

Central tendency

We could again use the Mode - the most common value - to summarise this data. However, because the responses are ordered, it can be more useful to think about the percentages of respondents in and below/above each category. For instance, we might want to talk about asking which category has 50% of the observations in a lower category, and 50% of the observations in a higher category. This is mid-point is known as the Median.

Median

The value for which 50% of observations a lower and 50% are higher. It is the mid-point of a list of ordered values.

To find the median:

rank order the values
find the middle value:
- If there are $n$ values, find the value at position $\frac{n+1}{2}$.
- If $n$ is even, $\frac{n+1}{2}$ will not be a whole number.
  For instance, if $n = 20$, you are looking for the $\frac{n+1}{2} = \frac{20+1}{2} = 10.5^{th}$ value.
  - When calculating the median for ordinal data, if the $\frac{n}{2}^{th}$ and $\frac{n+1}{2}^{th}$ values are different, report both.
  - When calculating the median for numeric data, report the midpoint of the $\frac{n}{2}^{th}$ and $\frac{n+1}{2}^{th}$ values.

In the previous lab we discussed how to tell R explicitly that a variable is of a certain type using functions such as as.factor(), as.numeric(), and so on.
You may notice that we haven’t done this yet with the data we have been working with in so far today:

# inside the "psych_survey" dataframe, take ($) the "happiness" variable,
# and tell me what type/class it is
class(psych_survey$happiness)

## [1] "numeric"

This is because there are some benefits to letting R think your data is numeric, even when it is not. It means we can use functions such as median() to quickly find the median:

# inside the "psych_survey" dataframe, take ($) the "happiness" variable,
# and find the median
median(psych_survey$happiness)

## [1] 3

Be careful
While we can make R treat this data is numeric, it is important to remember that it is actually measured on an ordinal scale.

For example, if the median falls between levels, R will tell us that the median is the mid-point:

# for the values 2,1,2,3,4,5, 
# find the median
median(c(2,1,2,3,5,4))

## [1] 2.5

But because our data is ordinal, then we know that 2.5 is not a valid response.

We can also use functions such as min() and max() to find the minimum and maximum values:

# inside the "psych_survey" dataframe, take ($) the "happiness" variable,
# and find the minimum value
min(psych_survey$happiness)

## [1] 1

# and find the maximum value
max(psych_survey$happiness)

## [1] 5

Cumulative percentages, Quartiles

In calculating the median, we are going beyond talking about the relative frequencies (i.e., the percentage in each category), to talking about the cumulative percentage.

Cumulative percentage

Cumulative percentages are another way of expressing a frequency distribution.
They are the successive addition of percentages in each category. For example, the cumulative percentage for the 3rd category is the percentage of respondents in the 1st, 2nd and 3rd category:

Category	Frequency count (n)	Relative frequency (%)	Cumulative frequency	Cumulative percentage
Response 1	10	13.33333	10	13.33333
Response 2	10	13.33333	20	26.66667
Response 3	20	26.66667	40	53.33333
Response 4	25	33.33333	65	86.66667
Response 5	10	13.33333	75	100.00000

We saw before how we can calculate the proportions/percentages in each category:
( Note: We multiply by 100 here to turn the proportion into a percentage)

# take the "psych_survey" dataframe %>%
# count() the values in the "happiness" variable (creates an "n" column), and
# from there, 'mutate' such that there is a variable called "percent" which
# has the values of the "n" variable divided by the sum of the "n" variable.
psych_survey %>%
  count(happiness) %>%
  mutate(
    percent = n/sum(n)*100
  )

## # A tibble: 5 x 3
##   happiness     n percent
##       <dbl> <int>   <dbl>
## 1         1     6    8.11
## 2         2    13   17.6 
## 3         3    27   36.5 
## 4         4    21   28.4 
## 5         5     7    9.46

We can add another variable, and make it the cumulative percentage, by using the cumsum() function.

# take the "psych_survey" dataframe %>%
# count() the values in the "happiness" variable (creates an "n" column), and
# from there, 'mutate' such that there is a variable called "percent" which
# has the values of the "n" variable divided by the sum of the "n" variable,
# and also make a variable called "cumulative_percent" which is the 
# successive addition of the values in the "percent" variable
psych_survey %>% 
  count(happiness) %>% 
  mutate(
    percent = n/sum(n)*100,
    cumulative_percent = cumsum(percent)
  )

## # A tibble: 5 x 4
##   happiness     n percent cumulative_percent
##       <dbl> <int>   <dbl>              <dbl>
## 1         1     6    8.11               8.11
## 2         2    13   17.6               25.7 
## 3         3    27   36.5               62.2 
## 4         4    21   28.4               90.5 
## 5         5     7    9.46             100

Optional extra. Click the plus to expand →

Think about why this will not work:

psych_survey %>% 
  count(happiness) %>% 
  mutate(
    cumulative_percent = cumsum(percent),
    percent = n/sum(n)*100
  )

Error: object ‘percent’ not found

Answer: Inside the mutate() function, we are trying to assign the cumulative_percent variable based on the values of the percent variable. But in the code above, percent gets defined after cumulative_percent, and so it will not work. Hence the error message (“percent not found”).

While the median splits the data in two (50% either side), you will often see data being split into four equal blocks.
The points which divide the four blocks are known as quartiles.

Quartiles

Quartiles are the points in rank-ordered data below which falls 25%, 50%, and 75% of the data.

The first quartile is the first category for which the cumulative percentage is $\geq 25\%$.
The median is the first category for which the cumulative percentage is $\geq 50\%$.
The third quartile is the first category for which the cumulative percentage is $\geq 75\%$.

By looking at the quartiles, it gives us an idea of how spread out the data is.
As an example, if we had 10 categories A, B, C, D, E, F, G, H, I, J, and we knew that:

$Q_1$ (the $1^{st}$ quartile) = G,
$Q_2$ (the $2^{nd}$ quartile, the median) = H,
$Q_3$ (the $3^{rd}$ quartile) = H,

This tells us that the first 25% of the data falls in one of the categories from A to G (quite a large range), the second 25% falls in categories G and H (a small range), and the third 25% of the data falls entirely in category H.
So a lot of the data is between G and H, with the data being more sparse in the lower and higher categories.

Looking ahead to numeric data

We will talk about quartiles in numeric data too, where we commonly use the difference between the first and third quartile as a measure of how spread out the data are. This gets known as the inter-quartile range (IQR).

Visualising

We can visualise ordered categorical data in the same way we did for unordered.
First we save our frequencies/percentages as a new object:

freq_table2 <- psych_survey %>%
  count(happiness) %>%
  mutate(
    percent = n/sum(n)*100
  )

Then we give that object to our ggplot code, with the appropriate aes() mappings:

# make a ggplot with the object "freq_table2". 
# on the x axis put the possible values in the "happiness" variable,
# on the y axis put the possible values in the "n" variable.
# add columns for each entry in the data. 
ggplot(data = freq_table2, aes(x = happiness, y = percent)) + 
  geom_col()

Glossary

distribution: How often different possible values in a variable occur.
frequency: Number of occurrences (count) in a given response value.
relative frequency: Percentage/proportion of occurrences in a given response value.
cumulative percentage: Percentage of occurrences in or below a given reponse value (requires ordered data).
mode: Most common value.
median: Middle value.
%>% Takes the output of whatever is on the LHS and gives it as the input of whatever is on the RHS.
count() Counts the number of occurrences of each unique value in a variable.
mutate() Used to add variables to the dataframe, or modify existing variables.
min() Returns the minimum value of a variable.
max() Returns the maximum value of a variable.
median() Returns the median value of a variable.
ggplot() Creates a plot. Takes data= and a set of mappings aes() from the data to properties of the plot (e.g., x/y axes, colours).
geom_col() Adds columns to a ggplot.

Exercises

Question 1

Open a new Rmarkdown document for this set of exercises.

File > New File > R Markdown..

In your first code-chunk, load the tidyverse packages with the following command:

library(tidyverse)

Make sure you run the chunk.

Question 2

We’re going to use the data on popular passwords which we saw in the previous lab.
The data is available online at https://uoepsy.github.io/data/passworddata.csv.

Read in the data from the link.

Solution

pwords <- read_csv("https://uoepsy.github.io/data/passworddata.csv")

Question 3

Produce a table of frequencies and relative frequencies (percentages/proportions) of different types of passwords

Solution

# assigned to an object called "type_freqs" : 
  # take the "pwords" dataframe %>%
  # count() the values in the "type" variable (creates an "n" column), and
  # from there, 'mutate' such that there is a variable called "percent" which
  # has the values of the "n" variable divided by the sum of the "n" variable,
type_freqs <- pwords %>% 
  count(type) %>%
  mutate(
    percent = n/sum(n)*100
  )

type_freqs

## # A tibble: 10 x 3
##    type                    n percent
##    <chr>               <int>   <dbl>
##  1 animal                 29     5.8
##  2 cool-macho             79    15.8
##  3 fluffy                 44     8.8
##  4 food                   11     2.2
##  5 name                  183    36.6
##  6 nerdy-pop              30     6  
##  7 password-related       15     3  
##  8 rebellious-rude        11     2.2
##  9 simple-alphanumeric    61    12.2
## 10 sport                  37     7.4

Question 4

What is the mode of password type? and what is the least common?

Solution

Question 5

Produce a plot of the frequencies of password types:

Solution

We have added the line scale_x_discrete(labels = abbreviate) here, which will make abbreviate the labels on the x-axis!

# make a ggplot with the data "type_freqs", and map the "type" variable
# to the x axis, and the "n" variable to the y axis. 
# add a geom_cols for each entry in the data
# abbreviate the labels on the x axis
ggplot(data = type_freqs, aes(x = type, y = n)) +
  geom_col()+
  scale_x_discrete(labels = abbreviate)

Question 6

In the previous exercises using this dataset we worked with the strength_cat variable, and made R treat it as an ordered categorical variable (weak < medium < strong).
The strength variable is also an ordered categorical variable, but contains more levels, measuring password strength on values from 1 to 10.

Produce a table showing the frequencies and cumulative percentages of the different strength levels of passwords in the data

Solution

# assigned to an object called "strength_table" : 
  # take the "pwords" dataframe %>%
  # count() the values in the "strength" variable (creates an "n" column), and
  # from there, 'mutate' such that there is a variable called "percent" which
  # has the values of the "n" variable divided by the sum of the "n" variable,
  # and also make a variable called "cumulative_percent" which is the 
  # successive addition of the values in the "percent" variable
strength_table <- pwords %>% 
  count(strength) %>% 
  mutate(
    percent = n/sum(n)*100,
    cumulative_percent = cumsum(percent)
  )

# show the object called "strength_table"
strength_table

## # A tibble: 10 x 4
##    strength     n percent cumulative_percent
##       <dbl> <int>   <dbl>              <dbl>
##  1        1    35     7                  7  
##  2        2     2     0.4                7.4
##  3        3     7     1.4                8.8
##  4        4    29     5.8               14.6
##  5        5    17     3.4               18  
##  6        6    76    15.2               33.2
##  7        7   101    20.2               53.4
##  8        8   162    32.4               85.8
##  9        9    46     9.2               95  
## 10       10    25     5                100

Question 7

From looking only at the table you made in the previous question, what is the median strength level?

Solution

Question 8

Check that your answer is correct by passing the strength variable to the median() function. Find also the minimum and maximum values.

Note: Did you make the strength variable a factor in one of the earlier questions? If so, median(pwords$strength) will not work, because median() needs it to be numeric.
If needed, you can convert the variable back to numeric:
pwords$strength <- as.numeric(pwords$strength).
Or simply do so temporarily:
median(as.numeric(pwords$strength))

Solution

# for the "strength" variable in the "pwords" dataframe,
# show the median value
median(pwords$strength)

## [1] 7

#find min and max
min(pwords$strength)

## [1] 1

max(pwords$strength)

## [1] 10

Question 9

Create a plot of the percentages of passwords in each strength level

Solution

# make a ggplot with the data "strength_table", and map the "strength" variable
# to the x axis, and the "percent" variable to the y axis. 
# add a geom_cols for each entry in the data
# add title and edit the x and y axis names
# use the classic theme
ggplot(data = strength_table, aes(x = strength, y = percent)) + 
  geom_col()+
  labs(title = "Relative frequencies of different password strengths", x = "Password Strength", y = "Percentage of passwords")+
  theme_classic()

Question 10

Think back at the definition of quartiles.
At what point is the fourth quartile?

The maximum value
The first category for which the cumulative percentage is $\geq 100\%$.
Both of the above

Solution

Question 11

Take a look at the two plots below. Why is one more useful than the other?

Solution

Categorical data

Data Exploration

Unordered Categorical (Nominal) Data

Frequency table

Central tendency

Relative frequencies

Visualising

Ordered Categorical (Ordinal) Data

Frequency table

Central tendency

Cumulative percentages, Quartiles

Visualising

Glossary

Exercises