Data Wrangling for Surveys & Questionnaires

Questionnaire data often comes to us in ‘wide’ format, which is often how we want it for many of the analytical methods we use with questionnaire data. However, working with data in the wide format comes with some specific challenges that generally arise because we have lots and lots of variables.

Below we will walk through some of the common ways we want to wrangle and clean questionnaire data.

Variable names

Very often, the data might come to us with huge long names that correspond to the question being asked.
For instance:

survdat <- read_csv("https://uoepsy.github.io/data/surveywrangle.csv")
head(survdat)
# A tibble: 6 × 53
    ppt   age height `[I am the life of the party.]` [I feel little concern fo…¹
  <dbl> <dbl>  <dbl> <chr>                           <chr>                      
1     1    22    168 Moderately Accurate             Very Inaccurate            
2     2    29    186 Neither Accurate nor Inaccurate Moderately Inaccurate      
3     3    24     NA Very Inaccurate                 Very Inaccurate            
4     4    26    175 Very Inaccurate                 Moderately Inaccurate      
5     5    28    182 Very Inaccurate                 Very Inaccurate            
6     6    27    159 Moderately Accurate             Very Inaccurate            
# ℹ abbreviated name: ¹​`[I feel little concern for others.]`
# ℹ 48 more variables: `[I am always prepared.]` <chr>,
#   `[I get stressed out easily.]` <chr>, `[I have a rich vocabulary.]` <chr>,
#   `[I don't talk a lot.]` <chr>, `[I am interested in people.]` <chr>,
#   `[I leave my belongings around.]` <chr>,
#   `[I am relaxed most of the time.]` <chr>,
#   `[I have difficulty understanding abstract ideas.]` <chr>, …

Often, we want our variables to have some easier names to work with, e.g. “q1”, “q2” etc.

Because it’s often important to keep track of what each question actually asks, it can be useful to create a little dictionary to map our new variable names to the current names.

In our data, we have variables called ppt, age and height, and then 50 variables corresponding to 50 different questions.
What I plan to do is rename them to ppt,age,height,q1, q2, …, q50.

The paste functions allow us to simply “paste” characters together:
By default, paste() will separate these with a space:

paste("i have a dog called", "dougal")
[1] "i have a dog called dougal"

and paste0() will not separate them by anything:

paste0("i have a dog called", "dougal")
[1] "i have a dog calleddougal"

We can paste the same thing to a set of things:

paste("i have a dog called", c("dougal", "rufus","rosie"))
[1] "i have a dog called dougal" "i have a dog called rufus" 
[3] "i have a dog called rosie" 

Which can be useful for quickly making a set of variable names:

paste0("item_",1:10)
 [1] "item_1"  "item_2"  "item_3"  "item_4"  "item_5"  "item_6"  "item_7" 
 [8] "item_8"  "item_9"  "item_10"
datadict <- 
  tibble(
    # these are my proposed new names
    variable = c("ppt","age","height", 
                 paste0("q", 1:50)),
    # current names
    question = names(survdat)
  )
head(datadict)
# A tibble: 6 × 2
  variable question                           
  <chr>    <chr>                              
1 ppt      ppt                                
2 age      age                                
3 height   height                             
4 q1       [I am the life of the party.]      
5 q2       [I feel little concern for others.]
6 q3       [I am always prepared.]            

If we want to, we can write this to a .csv file so that we can keep it somewhere useful.

write_csv(datadict, file = "survey_dictionary.csv")

Now we have created our record, we can actually change the names of the variables:

names(survdat)[4:53] <- paste0("q", 1:50)
head(survdat)
# A tibble: 6 × 53
    ppt   age height q1    q2    q3    q4    q5    q6    q7    q8    q9    q10  
  <dbl> <dbl>  <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1     1    22    168 Mode… Very… Mode… Mode… Neit… Mode… Mode… Mode… Mode… Neit…
2     2    29    186 Neit… Mode… Mode… Neit… Mode… Mode… Mode… Mode… Neit… Neit…
3     3    24     NA Very… Very… Neit… Mode… Mode… Mode… Mode… Neit… Mode… Very…
4     4    26    175 Very… Mode… Mode… Mode… Very… Mode… Mode… Mode… Mode… Mode…
5     5    28    182 Very… Very… Neit… Mode… Mode… Very… Neit… Neit… Mode… Mode…
6     6    27    159 Mode… Very… Mode… Very… Very… Neit… Mode… Very… Neit… Mode…
# ℹ 40 more variables: q11 <chr>, q12 <chr>, q13 <chr>, q14 <chr>, q15 <chr>,
#   q16 <chr>, q17 <chr>, q18 <chr>, q19 <chr>, q20 <chr>, q21 <chr>,
#   q22 <chr>, q23 <chr>, q24 <chr>, q25 <chr>, q26 <chr>, q27 <chr>,
#   q28 <chr>, q29 <chr>, q30 <chr>, q31 <chr>, q32 <chr>, q33 <chr>,
#   q34 <chr>, q35 <chr>, q36 <chr>, q37 <chr>, q38 <chr>, q39 <chr>,
#   q40 <chr>, q41 <chr>, q42 <chr>, q43 <chr>, q44 <chr>, q45 <chr>,
#   q46 <chr>, q47 <chr>, q48 <chr>, q49 <chr>, q50 <chr>

Variable recoding

First things first, all of the questions are scored on a 5 point likert1 scale.

The data has come out of whatever questionnaire software we used and it’s showing the chosen response for each question in words, e.g. “Moderately Accurate”. But we are going to turn them into numbers so that they are easier to work with.

response score
Very Inaccurate 1
Moderately Inaccurate 2
Neither Accurate nor Inaccurate 3
Moderately Accurate 4
Very Accurate 5

You might be shouting at the screen here that by turning these words into numbers we’re about to turn a set of ordered categories into an interval scale. Is “Moderately Inaccurate” really 2 x “Very Inaccurate”? You might be right to have this intuition - it’s a contentious issue! In psychology, in which questionnaire based methods are everywhere, an often used rule of thumb is that likert data with \(\geq 5\) levels can be treated as if they are continuous without unduly influencing results (see Johnson, D.R., & Creech, J.C. (1983). Ordinal measures in multiple indicator models: A simulation study of categorization error).

Let’s push on and see how we can do this sort of recoding.
First we’ll do it for just one variable. As always in R, there are lots of ways to do the same thing. All of these approaches follows an if-else type logic - “if we have ‘moderately inaccurate’, make it ‘2’, and if we have ’very inaccurate, make it 1, and if …”.

ifelse()

This is the most cumbersome way to do this logic as it requires multiple nested ifelse statements.
To ensure that anything that isn’t one of the 5 valid response options gets to be an NA, we have the final “else” specifying that if it’s none of the prespecified options, make it an NA

survdat |>
  mutate(
    q1 = ifelse(q1 == "Very Inaccurate", 1,
                ifelse(q1 == "Moderately Inaccurate", 2, 
                       ifelse( q1 == "Neither Accurate nor Inaccurate", 3, 
                               ifelse(q1 == "Moderately Accurate", 4, 
                                      ifelse(q1 == "Very Accurate", 5, NA)))))
  )
# A tibble: 76 × 53
     ppt   age height    q1 q2         q3    q4    q5    q6    q7    q8    q9   
   <dbl> <dbl>  <dbl> <dbl> <chr>      <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1     1    22    168     4 Very Inac… Mode… Mode… Neit… Mode… Mode… Mode… Mode…
 2     2    29    186     3 Moderatel… Mode… Neit… Mode… Mode… Mode… Mode… Neit…
 3     3    24     NA     1 Very Inac… Neit… Mode… Mode… Mode… Mode… Neit… Mode…
 4     4    26    175     1 Moderatel… Mode… Mode… Very… Mode… Mode… Mode… Mode…
 5     5    28    182     1 Very Inac… Neit… Mode… Mode… Very… Neit… Neit… Mode…
 6     6    27    159     4 Very Inac… Mode… Very… Very… Neit… Mode… Very… Neit…
 7     7    27    149     2 Very Inac… Mode… Very… Mode… Mode… Mode… Mode… Neit…
 8     8    23    178     2 Neither A… Very… Very… Neit… Neit… Mode… Very… Very…
 9     9    28    176     3 Moderatel… Very… Very… Very… Neit… Very… Neit… Neit…
10    10    26    192     3 Very Inac… Mode… Mode… Mode… Neit… Mode… Mode… Mode…
# ℹ 66 more rows
# ℹ 41 more variables: q10 <chr>, q11 <chr>, q12 <chr>, q13 <chr>, q14 <chr>,
#   q15 <chr>, q16 <chr>, q17 <chr>, q18 <chr>, q19 <chr>, q20 <chr>,
#   q21 <chr>, q22 <chr>, q23 <chr>, q24 <chr>, q25 <chr>, q26 <chr>,
#   q27 <chr>, q28 <chr>, q29 <chr>, q30 <chr>, q31 <chr>, q32 <chr>,
#   q33 <chr>, q34 <chr>, q35 <chr>, q36 <chr>, q37 <chr>, q38 <chr>,
#   q39 <chr>, q40 <chr>, q41 <chr>, q42 <chr>, q43 <chr>, q44 <chr>, …

case_when

case_when() follows this same logic as ifelse() but makes for easier reading (and writing). Each line is a statement saying “if [variable] is equal to [value], then return [new value]”. The final entry here TRUE ~ NA is a bit like saying “and if it is anything else, make it an NA”.

survdat |>
  mutate(
    q1 = case_when(
      q1 == "Very Inaccurate" ~ 1,
      q1 == "Moderately Inaccurate" ~ 2,
      q1 == "Neither Accurate nor Inaccurate" ~ 3,
      q1 == "Moderately Accurate" ~ 4,
      q1 == "Very Accurate" ~ 5,
      TRUE ~ NA
    )
  )
# A tibble: 76 × 53
     ppt   age height    q1 q2         q3    q4    q5    q6    q7    q8    q9   
   <dbl> <dbl>  <dbl> <dbl> <chr>      <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1     1    22    168     4 Very Inac… Mode… Mode… Neit… Mode… Mode… Mode… Mode…
 2     2    29    186     3 Moderatel… Mode… Neit… Mode… Mode… Mode… Mode… Neit…
 3     3    24     NA     1 Very Inac… Neit… Mode… Mode… Mode… Mode… Neit… Mode…
 4     4    26    175     1 Moderatel… Mode… Mode… Very… Mode… Mode… Mode… Mode…
 5     5    28    182     1 Very Inac… Neit… Mode… Mode… Very… Neit… Neit… Mode…
 6     6    27    159     4 Very Inac… Mode… Very… Very… Neit… Mode… Very… Neit…
 7     7    27    149     2 Very Inac… Mode… Very… Mode… Mode… Mode… Mode… Neit…
 8     8    23    178     2 Neither A… Very… Very… Neit… Neit… Mode… Very… Very…
 9     9    28    176     3 Moderatel… Very… Very… Very… Neit… Very… Neit… Neit…
10    10    26    192     3 Very Inac… Mode… Mode… Mode… Neit… Mode… Mode… Mode…
# ℹ 66 more rows
# ℹ 41 more variables: q10 <chr>, q11 <chr>, q12 <chr>, q13 <chr>, q14 <chr>,
#   q15 <chr>, q16 <chr>, q17 <chr>, q18 <chr>, q19 <chr>, q20 <chr>,
#   q21 <chr>, q22 <chr>, q23 <chr>, q24 <chr>, q25 <chr>, q26 <chr>,
#   q27 <chr>, q28 <chr>, q29 <chr>, q30 <chr>, q31 <chr>, q32 <chr>,
#   q33 <chr>, q34 <chr>, q35 <chr>, q36 <chr>, q37 <chr>, q38 <chr>,
#   q39 <chr>, q40 <chr>, q41 <chr>, q42 <chr>, q43 <chr>, q44 <chr>, …

case_match()

The ifelse() and case_when() functions are much more generalised - we can say “if x is 1 make it 1, and if y is 2 make it 0, and if z is …”.

However, in the task at hand, all of our “if” questions apply to the value of the same variable. The case_match() function is perfectly designed for this:

survdat |> 
  mutate(
    q1 = case_match(q1, 
                "Very Inaccurate" ~ 1,
                "Moderately Inaccurate" ~ 2,
                "Neither Accurate nor Inaccurate" ~ 3,
                "Moderately Accurate" ~ 4,
                "Very Accurate" ~ 5)
  )
# A tibble: 76 × 53
     ppt   age height    q1 q2         q3    q4    q5    q6    q7    q8    q9   
   <dbl> <dbl>  <dbl> <dbl> <chr>      <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1     1    22    168     4 Very Inac… Mode… Mode… Neit… Mode… Mode… Mode… Mode…
 2     2    29    186     3 Moderatel… Mode… Neit… Mode… Mode… Mode… Mode… Neit…
 3     3    24     NA     1 Very Inac… Neit… Mode… Mode… Mode… Mode… Neit… Mode…
 4     4    26    175     1 Moderatel… Mode… Mode… Very… Mode… Mode… Mode… Mode…
 5     5    28    182     1 Very Inac… Neit… Mode… Mode… Very… Neit… Neit… Mode…
 6     6    27    159     4 Very Inac… Mode… Very… Very… Neit… Mode… Very… Neit…
 7     7    27    149     2 Very Inac… Mode… Very… Mode… Mode… Mode… Mode… Neit…
 8     8    23    178     2 Neither A… Very… Very… Neit… Neit… Mode… Very… Very…
 9     9    28    176     3 Moderatel… Very… Very… Very… Neit… Very… Neit… Neit…
10    10    26    192     3 Very Inac… Mode… Mode… Mode… Neit… Mode… Mode… Mode…
# ℹ 66 more rows
# ℹ 41 more variables: q10 <chr>, q11 <chr>, q12 <chr>, q13 <chr>, q14 <chr>,
#   q15 <chr>, q16 <chr>, q17 <chr>, q18 <chr>, q19 <chr>, q20 <chr>,
#   q21 <chr>, q22 <chr>, q23 <chr>, q24 <chr>, q25 <chr>, q26 <chr>,
#   q27 <chr>, q28 <chr>, q29 <chr>, q30 <chr>, q31 <chr>, q32 <chr>,
#   q33 <chr>, q34 <chr>, q35 <chr>, q36 <chr>, q37 <chr>, q38 <chr>,
#   q39 <chr>, q40 <chr>, q41 <chr>, q42 <chr>, q43 <chr>, q44 <chr>, …

If we wanted to collapse levels, the syntax works like so:

survdat |> 
  mutate(
    q1 = case_match(q1, 
                    c("Very Inaccurate","Moderately Inaccurate") ~ 1,
                    "Neither Accurate nor Inaccurate" ~ 2,
                    c("Moderately Accurate","Very Accurate") ~ 3
    )
  )

Applying the same operation across variables

Now that we know how to recode a variable, what if we want to do it for all 50 questions? We don’t want to do it manually, because that would be hugely inefficient:

survdat |> mutate(
  q1 = case_match(q1,
    ...
    ..
    ...
  ),
  q2 = case_match(q2,
    ...
    ..
    ...
  ),
  ..
)

There are various ways to do this, but the two main ones are by utilising tidyverse’s across() function, or by using the Base R apply() function.

across()

The across() function works inside other tidyverse functions like summarise() and mutate(), and allows us to do the same operation across a set of columns.

The syntax is (with either summarise or mutate):

data |>
  mutate(across(variables, operation))

As an example, suppose i wanted to summarise across the first three variables (ppt,age, and height) and get the mean. We can either specify the column numbers, or a set of names, or use the : to choose all variables between two columns.
These all do the same:

survdat |> summarise(across(1:3, mean))
survdat |> summarise(across(c(ppt,age,height), mean))
survdat |> summarise(across(c(ppt:height), mean))
# A tibble: 1 × 3
    ppt   age height
  <dbl> <dbl>  <dbl>
1  38.5  24.3     NA

Note that we get an NA in the mean heights, because there is an NA in the data. So we need to specify that na.rm=TRUE when using the mean() function.
Unfortunately, this won’t work:

survdat |> summarise(across(1:3, mean(na.rm=T)))
Error in `summarise()`:
ℹ In argument: `across(1:3, mean(na.rm = T))`.
Caused by error in `mean.default()`:
! argument "x" is missing, with no default

This is because when we open the brackets of mean(), it looks for the object we want to take the mean of. We can pass this through using ~ and ., like so: (You can kind of read the . as a place-holder for each variable)

survdat |> summarise(across(1:3, ~mean(x=., na.rm=T)))
# A tibble: 1 × 3
    ppt   age height
  <dbl> <dbl>  <dbl>
1  38.5  24.3   169.

Using this syntax, we can apply the recoding we have seen above, but do it for all of the variables from q1 to q50. All at once. like magic!
Note, we’re not summarising here (i.e. we’re not summarising down into a single value, we’re just wanting to mutate the existing variables):

survdat |> 
  mutate(across(q1:q50,
                ~case_match(., 
                "Very Inaccurate" ~ 1,
                "Moderately Inaccurate" ~ 2,
                "Neither Accurate nor Inaccurate" ~ 3,
                "Moderately Accurate" ~ 4,
                "Very Accurate" ~ 5)
  ))
# A tibble: 76 × 53
     ppt   age height    q1    q2    q3    q4    q5    q6    q7    q8    q9
   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1    22    168     4     1     4     4     3     2     4     4     4
 2     2    29    186     3     2     4     3     4     2     4     2     3
 3     3    24     NA     1     1     3     2     2     4     4     3     4
 4     4    26    175     1     2     4     4     5     2     4     4     2
 5     5    28    182     1     1     3     4     4     5     3     3     2
 6     6    27    159     4     1     2     1     1     3     4     1     3
 7     7    27    149     2     1     4     5     4     4     4     4     3
 8     8    23    178     2     3     5     5     3     3     2     1     1
 9     9    28    176     3     2     5     5     5     3     5     3     3
10    10    26    192     3     1     4     4     4     3     4     4     4
# ℹ 66 more rows
# ℹ 41 more variables: q10 <dbl>, q11 <dbl>, q12 <dbl>, q13 <dbl>, q14 <dbl>,
#   q15 <dbl>, q16 <dbl>, q17 <dbl>, q18 <dbl>, q19 <dbl>, q20 <dbl>,
#   q21 <dbl>, q22 <dbl>, q23 <dbl>, q24 <dbl>, q25 <dbl>, q26 <dbl>,
#   q27 <dbl>, q28 <dbl>, q29 <dbl>, q30 <dbl>, q31 <dbl>, q32 <dbl>,
#   q33 <dbl>, q34 <dbl>, q35 <dbl>, q36 <dbl>, q37 <dbl>, q38 <dbl>,
#   q39 <dbl>, q40 <dbl>, q41 <dbl>, q42 <dbl>, q43 <dbl>, q44 <dbl>, …

apply()

Base R (the non-tidyverse stuff) has some really useful functions of the apply() family, that can allow us to do the same thing (apply an operation across a set of columns).

For apply(), we give it an object like a dataframe, and we give it the dimension that we wish to perform the operation over. If we choose 1, it will do it for each row, and if we choose 2, it will do it for each column.
So if we want the mean of the first 3 columns:

apply(survdat[,1:3], MARGIN=2, FUN=mean)
     ppt      age   height 
38.50000 24.28947       NA 

Much like with across(), this doesn’t work because mean() now wants x explicitly given:

apply(survdat[,1:3], MARGIN=2, FUN=mean(na.rm=TRUE))
Error in mean.default(na.rm = TRUE): argument "x" is missing, with no default

But instead of the ~ and . stuff in tidyverse, the apply() wants us to explicitly define a little function, using function().

apply(survdat[,1:3], MARGIN=2, FUN=function(x) mean(x, na.rm=TRUE))
      ppt       age    height 
 38.50000  24.28947 169.36000 

(it’s actually just convention that people use function(x). we could use function(kermit) mean(kermit, na.rm=TRUE) and it would still work!)

So how do we do the recoding of all our questions? We simply apply a function like case_match() to each column!

survdat[,4:53] <- apply(survdat[,4:53], MARGIN=2, FUN=function(x) 
  case_match(x, 
             "Very Inaccurate" ~ 1,
             "Moderately Inaccurate" ~ 2,
             "Neither Accurate nor Inaccurate" ~ 3,
             "Moderately Accurate" ~ 4,
             "Very Accurate" ~ 5)
  )
survdat
# A tibble: 76 × 53
     ppt   age height    q1    q2    q3    q4    q5    q6    q7    q8    q9
   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1    22    168     4     1     4     4     3     2     4     4     4
 2     2    29    186     3     2     4     3     4     2     4     2     3
 3     3    24     NA     1     1     3     2     2     4     4     3     4
 4     4    26    175     1     2     4     4     5     2     4     4     2
 5     5    28    182     1     1     3     4     4     5     3     3     2
 6     6    27    159     4     1     2     1     1     3     4     1     3
 7     7    27    149     2     1     4     5     4     4     4     4     3
 8     8    23    178     2     3     5     5     3     3     2     1     1
 9     9    28    176     3     2     5     5     5     3     5     3     3
10    10    26    192     3     1     4     4     4     3     4     4     4
# ℹ 66 more rows
# ℹ 41 more variables: q10 <dbl>, q11 <dbl>, q12 <dbl>, q13 <dbl>, q14 <dbl>,
#   q15 <dbl>, q16 <dbl>, q17 <dbl>, q18 <dbl>, q19 <dbl>, q20 <dbl>,
#   q21 <dbl>, q22 <dbl>, q23 <dbl>, q24 <dbl>, q25 <dbl>, q26 <dbl>,
#   q27 <dbl>, q28 <dbl>, q29 <dbl>, q30 <dbl>, q31 <dbl>, q32 <dbl>,
#   q33 <dbl>, q34 <dbl>, q35 <dbl>, q36 <dbl>, q37 <dbl>, q38 <dbl>,
#   q39 <dbl>, q40 <dbl>, q41 <dbl>, q42 <dbl>, q43 <dbl>, q44 <dbl>, …

We can also define our custom functions outside of the use of apply()/across(), which can be useful if we have more complex functions.

For example, we can do the same recoding and then standardise each variable, by explicitly defining a function that does that:

recodemysurvey <- function(question){
  score = case_match(question, 
             "Very Inaccurate" ~ 1,
             "Moderately Inaccurate" ~ 2,
             "Neither Accurate nor Inaccurate" ~ 3,
             "Moderately Accurate" ~ 4,
             "Very Accurate" ~ 5)
  score = (score - mean(score, na.rm = TRUE)) / sd(score, na.rm = TRUE)
  return(score)
}

And then using that function, either with:

survdat |>
  mutate(across(q1:q50, recodemysurvey))

or with:

apply(survdat[,4:53], 2, recodemysurvey)

Reverse coding

It’s very common for questionnaires to have a combination of questions for a construct, some of which are positively phrased and some which are negatively phrased. The idea here is to avoid introducing biases such as a bias to agree more than to disagree.

Take, for instance, q1 and q6 in our data, which are both measuring Extraversion:

datadict |>
  filter(variable %in% c("q1","q6"))
# A tibble: 2 × 2
  variable question                     
  <chr>    <chr>                        
1 q1       [I am the life of the party.]
2 q6       [I don't talk a lot.]        

Someone very extraverted might respond to q1 with “Very Accurate”, and q6 with “Very Inaccurate”.

Often when we have a set of questions, we will want to reverse code the negatively phrased ones in order to have them all in the same direction. I.e. once they are recoded into numbers, we want the value 5 to represent highly extraverted on both q1 and q6.

How do we reverse? We want any value of 1 to become 5, any value of 2 to become 4, 3 stays 3, 4 becomes 2, and 5 becomes 1.
We can just subtract the value from 6!

6-c(1,2,3,4,5)
[1] 5 4 3 2 1

So to reverse code q6, we can simply subtract the variable from 6.

survdat |> 
  mutate(q6 = 6-q6)
# A tibble: 76 × 53
     ppt   age height    q1    q2    q3    q4    q5    q6    q7    q8    q9
   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1    22    168     4     1     4     4     3     4     4     4     4
 2     2    29    186     3     2     4     3     4     4     4     2     3
 3     3    24     NA     1     1     3     2     2     2     4     3     4
 4     4    26    175     1     2     4     4     5     4     4     4     2
 5     5    28    182     1     1     3     4     4     1     3     3     2
 6     6    27    159     4     1     2     1     1     3     4     1     3
 7     7    27    149     2     1     4     5     4     2     4     4     3
 8     8    23    178     2     3     5     5     3     3     2     1     1
 9     9    28    176     3     2     5     5     5     3     5     3     3
10    10    26    192     3     1     4     4     4     3     4     4     4
# ℹ 66 more rows
# ℹ 41 more variables: q10 <dbl>, q11 <dbl>, q12 <dbl>, q13 <dbl>, q14 <dbl>,
#   q15 <dbl>, q16 <dbl>, q17 <dbl>, q18 <dbl>, q19 <dbl>, q20 <dbl>,
#   q21 <dbl>, q22 <dbl>, q23 <dbl>, q24 <dbl>, q25 <dbl>, q26 <dbl>,
#   q27 <dbl>, q28 <dbl>, q29 <dbl>, q30 <dbl>, q31 <dbl>, q32 <dbl>,
#   q33 <dbl>, q34 <dbl>, q35 <dbl>, q36 <dbl>, q37 <dbl>, q38 <dbl>,
#   q39 <dbl>, q40 <dbl>, q41 <dbl>, q42 <dbl>, q43 <dbl>, q44 <dbl>, …

But we’re going to want to do this for all the negatively phrased questions in our survey. If we take a look at the inventory page of the IPIP, we can see that these are all the even questions from 2 to 38, and then questions 39, 44, 46, and 49.

That’s a frustrating set!
Let’s find a way to capture those. seq(2,38,2) will give us a sequence from 2 to 38 in steps of 2. So that gives us most of them. We can then just combine that with the other ones, and put “q” in front of each one to give us the variable names:

neg_items <- paste0("q", c(seq(2,38,2),39,44,46,49))
neg_items
 [1] "q2"  "q4"  "q6"  "q8"  "q10" "q12" "q14" "q16" "q18" "q20" "q22" "q24"
[13] "q26" "q28" "q30" "q32" "q34" "q36" "q38" "q39" "q44" "q46" "q49"

using across()

survdat |> 
  mutate(across(all_of(neg_items),~6-.))
# A tibble: 76 × 53
     ppt   age height    q1    q2    q3    q4    q5    q6    q7    q8    q9
   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1    22    168     4     5     4     2     3     4     4     2     4
 2     2    29    186     3     4     4     3     4     4     4     4     3
 3     3    24     NA     1     5     3     4     2     2     4     3     4
 4     4    26    175     1     4     4     2     5     4     4     2     2
 5     5    28    182     1     5     3     2     4     1     3     3     2
 6     6    27    159     4     5     2     5     1     3     4     5     3
 7     7    27    149     2     5     4     1     4     2     4     2     3
 8     8    23    178     2     3     5     1     3     3     2     5     1
 9     9    28    176     3     4     5     1     5     3     5     3     3
10    10    26    192     3     5     4     2     4     3     4     2     4
# ℹ 66 more rows
# ℹ 41 more variables: q10 <dbl>, q11 <dbl>, q12 <dbl>, q13 <dbl>, q14 <dbl>,
#   q15 <dbl>, q16 <dbl>, q17 <dbl>, q18 <dbl>, q19 <dbl>, q20 <dbl>,
#   q21 <dbl>, q22 <dbl>, q23 <dbl>, q24 <dbl>, q25 <dbl>, q26 <dbl>,
#   q27 <dbl>, q28 <dbl>, q29 <dbl>, q30 <dbl>, q31 <dbl>, q32 <dbl>,
#   q33 <dbl>, q34 <dbl>, q35 <dbl>, q36 <dbl>, q37 <dbl>, q38 <dbl>,
#   q39 <dbl>, q40 <dbl>, q41 <dbl>, q42 <dbl>, q43 <dbl>, q44 <dbl>, …

using apply()

survdat[,neg_items] <- apply(survdat[,neg_items], 2, function(x) 6-x)
survdat
# A tibble: 76 × 53
     ppt   age height    q1    q2    q3    q4    q5    q6    q7    q8    q9
   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1    22    168     4     5     4     2     3     4     4     2     4
 2     2    29    186     3     4     4     3     4     4     4     4     3
 3     3    24     NA     1     5     3     4     2     2     4     3     4
 4     4    26    175     1     4     4     2     5     4     4     2     2
 5     5    28    182     1     5     3     2     4     1     3     3     2
 6     6    27    159     4     5     2     5     1     3     4     5     3
 7     7    27    149     2     5     4     1     4     2     4     2     3
 8     8    23    178     2     3     5     1     3     3     2     5     1
 9     9    28    176     3     4     5     1     5     3     5     3     3
10    10    26    192     3     5     4     2     4     3     4     2     4
# ℹ 66 more rows
# ℹ 41 more variables: q10 <dbl>, q11 <dbl>, q12 <dbl>, q13 <dbl>, q14 <dbl>,
#   q15 <dbl>, q16 <dbl>, q17 <dbl>, q18 <dbl>, q19 <dbl>, q20 <dbl>,
#   q21 <dbl>, q22 <dbl>, q23 <dbl>, q24 <dbl>, q25 <dbl>, q26 <dbl>,
#   q27 <dbl>, q28 <dbl>, q29 <dbl>, q30 <dbl>, q31 <dbl>, q32 <dbl>,
#   q33 <dbl>, q34 <dbl>, q35 <dbl>, q36 <dbl>, q37 <dbl>, q38 <dbl>,
#   q39 <dbl>, q40 <dbl>, q41 <dbl>, q42 <dbl>, q43 <dbl>, q44 <dbl>, …

Row scoring

Very often, people want to calculate a scale score. For instance, if people responded to each of 7 questions for a scale of generalised anxiety, then we would want to calculate a ‘generalised anxiety score’ by adding up their responses 5+1+2+4+4+3+5 = 24.

There are handy functions that can do quick row scoring: rowSums() and rowMeans().

For instance, to calculate each person’s (each row of our data is a person) sum of the questions 1 to 5, we can use any of:

# tidyverse
survdat |> select(q1:q5) |> rowSums()
# indexing:
rowSums(survdat[,c("q1","q2","q3","q4","q5")])
# concise indexing
rowSums(survdat[, paste0("q",1:5)])
 [1] 18 18 15 16 15 17 16 14 18 18 17  8 16 13 11 14 16 18 20 17 13 16 14 17 17
[26] 18 19 14 16 12 18 20 18 17 15 15  7 12 15 15 19 17 22 20 20 19 16 14 17 10
[51] 15 18 19 17 10 19 16 12 22 16 12 15 14 19  9  9 17 20 15 17 19 16 18 16 13
[76] 14

From the inventory page of the IPIP, we can see that the questions are ordered such that questions 1, 6, 11, …, all correspond to Extraversion, questions 2, 7, 12, … all correspond to Agreeableness, and so on.

We can use things like seq() to capture the relevant numbers of each question:

# sequence from 1 to 50 in steps of 5
seq(1,50,5) 
 [1]  1  6 11 16 21 26 31 36 41 46
# sequence from 1 to 50 in steps of 5
seq(2,50,5) 
 [1]  2  7 12 17 22 27 32 37 42 47

Which means we can get the scores for each personality trait using something such as:

survdat$extr <- rowSums(survdat[, paste0("q",seq(1,50,5))])
survdat$agree <- rowSums(survdat[, paste0("q",seq(2,50,5))])
survdat$consc <- rowSums(survdat[, paste0("q",seq(3,50,5))])
survdat$emot_s <- rowSums(survdat[, paste0("q",seq(4,50,5))])
survdat$imagin <- rowSums(survdat[, paste0("q",seq(5,50,5))])

And we can quickly examine all our scale scores:

library(psych)
survdat |> 
  select(extr:imagin) |>
  pairs.panels()

More Rowwise Stuff

We can also find out useful information for each row about things like how much missingness there is:

# is.na will turn all NA entries to "TRUE", and all non-missing values to FALSE. 
# then we sum them up across the rows and it will give us how many NAs are in each row, across the 50 questions!  
rowSums( is.na( survdat[, paste0("q",1:50)] ) )
 [1] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
[39] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Occasionally, you might find that you want to do row-wise operations that don’t have easy functions like rowSums. In these cases, the tidyverse function rowwise() can be useful in that it applies an operation to one row at a time:

survdat |> 
  rowwise() |>
  mutate(
    # find the minimum score given for each row
    maxscore = min(across(q1:q50)), 
    # put the column after the height column
    .after = height
  )
# A tibble: 76 × 59
# Rowwise: 
     ppt   age height maxscore    q1    q2    q3    q4    q5    q6    q7    q8
   <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1    22    168        1     4     5     4     2     3     4     4     2
 2     2    29    186        2     3     4     4     3     4     4     4     4
 3     3    24     NA        1     1     5     3     4     2     2     4     3
 4     4    26    175        1     1     4     4     2     5     4     4     2
 5     5    28    182        1     1     5     3     2     4     1     3     3
 6     6    27    159        1     4     5     2     5     1     3     4     5
 7     7    27    149        1     2     5     4     1     4     2     4     2
 8     8    23    178        1     2     3     5     1     3     3     2     5
 9     9    28    176        1     3     4     5     1     5     3     5     3
10    10    26    192        1     3     5     4     2     4     3     4     2
# ℹ 66 more rows
# ℹ 47 more variables: q9 <dbl>, q10 <dbl>, q11 <dbl>, q12 <dbl>, q13 <dbl>,
#   q14 <dbl>, q15 <dbl>, q16 <dbl>, q17 <dbl>, q18 <dbl>, q19 <dbl>,
#   q20 <dbl>, q21 <dbl>, q22 <dbl>, q23 <dbl>, q24 <dbl>, q25 <dbl>,
#   q26 <dbl>, q27 <dbl>, q28 <dbl>, q29 <dbl>, q30 <dbl>, q31 <dbl>,
#   q32 <dbl>, q33 <dbl>, q34 <dbl>, q35 <dbl>, q36 <dbl>, q37 <dbl>,
#   q38 <dbl>, q39 <dbl>, q40 <dbl>, q41 <dbl>, q42 <dbl>, q43 <dbl>, …

Alternatively, apply() to the rescue again!
MARGIN = 1 here applies the operation across rows:

apply(survdat[,paste0("q",1:50)], MARGIN = 1, FUN = min)
 [1]  1  2  1  1  1  1  1  1  1  1 NA  1  1  1 NA  1  1  1  2  1  1  1  1  1  1
[26]  1  1  1  2 NA  1  1  1  1  1  1  1  1  1  1  2  2  2  2  1  1  1 NA  1  1
[51]  1  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  2  1  2  1  1  1  1  1  1
[76]  1

Footnotes

  1. Technically this is pronounced “LICK-URT” and not “LIE-KURT”. It’s named after Dr Rensis Likert, and that’s how he pronounced his name!↩︎