Questionnaire data often comes to us in ‘wide’ format, which is often how we want it for many of the analytical methods we use with questionnaire data. However, working with data in the wide format comes with some specific challenges that generally arise because we have lots and lots of variables.
Below we will walk through some of the common ways we want to wrangle and clean questionnaire data.
Variable names
Very often, the data might come to us with huge long names that correspond to the question being asked.
For instance:
# A tibble: 6 × 53
ppt age height `[I am the life of the party.]` [I feel little concern fo…¹
<dbl> <dbl> <dbl> <chr> <chr>
1 1 22 168 Moderately Accurate Very Inaccurate
2 2 29 186 Neither Accurate nor Inaccurate Moderately Inaccurate
3 3 24 NA Very Inaccurate Very Inaccurate
4 4 26 175 Very Inaccurate Moderately Inaccurate
5 5 28 182 Very Inaccurate Very Inaccurate
6 6 27 159 Moderately Accurate Very Inaccurate
# ℹ abbreviated name: ¹`[I feel little concern for others.]`
# ℹ 48 more variables: `[I am always prepared.]` <chr>,
# `[I get stressed out easily.]` <chr>, `[I have a rich vocabulary.]` <chr>,
# `[I don't talk a lot.]` <chr>, `[I am interested in people.]` <chr>,
# `[I leave my belongings around.]` <chr>,
# `[I am relaxed most of the time.]` <chr>,
# `[I have difficulty understanding abstract ideas.]` <chr>, …
Often, we want our variables to have some easier names to work with, e.g. “q1”, “q2” etc.
Because it’s often important to keep track of what each question actually asks, it can be useful to create a little dictionary to map our new variable names to the current names.
In our data, we have variables called ppt, age and height, and then 50 variables corresponding to 50 different questions.
What I plan to do is rename them to ppt,age,height,q1, q2, …, q50.
what do paste()/paste0() do?
The paste functions allow us to simply “paste” characters together:
By default, paste() will separate these with a space:
paste("i have a dog called", "dougal")
[1] "i have a dog called dougal"
and paste0() will not separate them by anything:
paste0("i have a dog called", "dougal")
[1] "i have a dog calleddougal"
We can paste the same thing to a set of things:
paste("i have a dog called", c("dougal", "rufus","rosie"))
[1] "i have a dog called dougal" "i have a dog called rufus"
[3] "i have a dog called rosie"
Which can be useful for quickly making a set of variable names:
datadict <-tibble(# these are my proposed new namesvariable =c("ppt","age","height", paste0("q", 1:50)),# current namesquestion =names(survdat) )head(datadict)
# A tibble: 6 × 2
variable question
<chr> <chr>
1 ppt ppt
2 age age
3 height height
4 q1 [I am the life of the party.]
5 q2 [I feel little concern for others.]
6 q3 [I am always prepared.]
If we want to, we can write this to a .csv file so that we can keep it somewhere useful.
First things first, all of the questions are scored on a 5 point likert1 scale.
The data has come out of whatever questionnaire software we used and it’s showing the chosen response for each question in words, e.g. “Moderately Accurate”. But we are going to turn them into numbers so that they are easier to work with.
response
score
Very Inaccurate
1
Moderately Inaccurate
2
Neither Accurate nor Inaccurate
3
Moderately Accurate
4
Very Accurate
5
You might be shouting at the screen here that by turning these words into numbers we’re about to turn a set of ordered categories into an interval scale. Is “Moderately Inaccurate” really 2 x “Very Inaccurate”? You might be right to have this intuition - it’s a contentious issue! In psychology, in which questionnaire based methods are everywhere, an often used rule of thumb is that likert data with \(\geq 5\) levels can be treated as if they are continuous without unduly influencing results (see Johnson, D.R., & Creech, J.C. (1983). Ordinal measures in multiple indicator models: A simulation study of categorization error).
Let’s push on and see how we can do this sort of recoding.
First we’ll do it for just one variable. As always in R, there are lots of ways to do the same thing. All of these approaches follows an if-else type logic - “if we have ‘moderately inaccurate’, make it ‘2’, and if we have ’very inaccurate, make it 1, and if …”.
ifelse()
This is the most cumbersome way to do this logic as it requires multiple nested ifelse statements.
To ensure that anything that isn’t one of the 5 valid response options gets to be an NA, we have the final “else” specifying that if it’s none of the prespecified options, make it an NA
case_when() follows this same logic as ifelse() but makes for easier reading (and writing). Each line is a statement saying “if [variable] is equal to [value], then return [new value]”. The final entry here TRUE ~ NA is a bit like saying “and if it is anything else, make it an NA”.
The ifelse() and case_when() functions are much more generalised - we can say “if x is 1 make it 1, and if y is 2 make it 0, and if z is …”.
However, in the task at hand, all of our “if” questions apply to the value of the same variable. The case_match() function is perfectly designed for this:
survdat |>mutate(q1 =case_match(q1, "Very Inaccurate"~1,"Moderately Inaccurate"~2,"Neither Accurate nor Inaccurate"~3,"Moderately Accurate"~4,"Very Accurate"~5) )
If we wanted to collapse levels, the syntax works like so:
survdat |>mutate(q1 =case_match(q1, c("Very Inaccurate","Moderately Inaccurate") ~1,"Neither Accurate nor Inaccurate"~2,c("Moderately Accurate","Very Accurate") ~3 ) )
Applying the same operation across variables
Now that we know how to recode a variable, what if we want to do it for all 50 questions? We don’t want to do it manually, because that would be hugely inefficient:
There are various ways to do this, but the two main ones are by utilising tidyverse’s across() function, or by using the Base R apply() function.
across()
The across() function works inside other tidyverse functions like summarise() and mutate(), and allows us to do the same operation across a set of columns.
The syntax is (with either summarise or mutate):
data |>mutate(across(variables, operation))
As an example, suppose i wanted to summarise across the first three variables (ppt,age, and height) and get the mean. We can either specify the column numbers, or a set of names, or use the : to choose all variables between two columns.
These all do the same:
# A tibble: 1 × 3
ppt age height
<dbl> <dbl> <dbl>
1 38.5 24.3 NA
Note that we get an NA in the mean heights, because there is an NA in the data. So we need to specify that na.rm=TRUE when using the mean() function.
Unfortunately, this won’t work:
survdat |>summarise(across(1:3, mean(na.rm=T)))
Error in `summarise()`:
ℹ In argument: `across(1:3, mean(na.rm = T))`.
Caused by error in `mean.default()`:
! argument "x" is missing, with no default
This is because when we open the brackets of mean(), it looks for the object we want to take the mean of. We can pass this through using ~ and ., like so: (You can kind of read the . as a place-holder for each variable)
# A tibble: 1 × 3
ppt age height
<dbl> <dbl> <dbl>
1 38.5 24.3 169.
Using this syntax, we can apply the recoding we have seen above, but do it for all of the variables from q1 to q50. All at once. like magic!
Note, we’re not summarising here (i.e. we’re not summarising down into a single value, we’re just wanting to mutate the existing variables):
survdat |>mutate(across(q1:q50,~case_match(., "Very Inaccurate"~1,"Moderately Inaccurate"~2,"Neither Accurate nor Inaccurate"~3,"Moderately Accurate"~4,"Very Accurate"~5) ))
Base R (the non-tidyverse stuff) has some really useful functions of the apply() family, that can allow us to do the same thing (apply an operation across a set of columns).
For apply(), we give it an object like a dataframe, and we give it the dimension that we wish to perform the operation over. If we choose 1, it will do it for each row, and if we choose 2, it will do it for each column.
So if we want the mean of the first 3 columns:
apply(survdat[,1:3], MARGIN=2, FUN=mean)
ppt age height
38.5 24.3 NA
Much like with across(), this doesn’t work because mean() now wants x explicitly given:
It’s very common for questionnaires to have a combination of questions for a construct, some of which are positively phrased and some which are negatively phrased. The idea here is to avoid introducing biases such as a bias to agree more than to disagree.
Take, for instance, q1 and q6 in our data, which are both measuring Extraversion:
datadict |>filter(variable %in%c("q1","q6"))
# A tibble: 2 × 2
variable question
<chr> <chr>
1 q1 [I am the life of the party.]
2 q6 [I don't talk a lot.]
Someone very extraverted might respond to q1 with “Very Accurate”, and q6 with “Very Inaccurate”.
Often when we have a set of questions, we will want to reverse code the negatively phrased ones in order to have them all in the same direction. I.e. once they are recoded into numbers, we want the value 5 to represent highly extraverted on bothq1 and q6.
How do we reverse? We want any value of 1 to become 5, any value of 2 to become 4, 3 stays 3, 4 becomes 2, and 5 becomes 1.
We can just subtract the value from 6!
6-c(1,2,3,4,5)
[1] 5 4 3 2 1
So to reverse code q6, we can simply subtract the variable from 6.
But we’re going to want to do this for all the negatively phrased questions in our survey. If we take a look at the inventory page of the IPIP, we can see that these are all the even questions from 2 to 38, and then questions 39, 44, 46, and 49.
That’s a frustrating set!
Let’s find a way to capture those. seq(2,38,2) will give us a sequence from 2 to 38 in steps of 2. So that gives us most of them. We can then just combine that with the other ones, and put “q” in front of each one to give us the variable names:
Very often, people want to calculate a scale score. For instance, if people responded to each of 7 questions for a scale of generalised anxiety, then we would want to calculate a ‘generalised anxiety score’ by adding up their responses 5+1+2+4+4+3+5 = 24.
There are handy functions that can do quick row scoring: rowSums() and rowMeans().
For instance, to calculate each person’s (each row of our data is a person) sum of the questions 1 to 5, we can use any of:
From the inventory page of the IPIP, we can see that the questions are ordered such that questions 1, 6, 11, …, all correspond to Extraversion, questions 2, 7, 12, … all correspond to Agreeableness, and so on.
We can use things like seq() to capture the relevant numbers of each question:
# sequence from 1 to 50 in steps of 5seq(1,50,5)
[1] 1 6 11 16 21 26 31 36 41 46
# sequence from 1 to 50 in steps of 5seq(2,50,5)
[1] 2 7 12 17 22 27 32 37 42 47
Which means we can get the scores for each personality trait using something such as:
We can also find out useful information for each row about things like how much missingness there is:
# is.na will turn all NA entries to "TRUE", and all non-missing values to FALSE. # then we sum them up across the rows and it will give us how many NAs are in each row, across the 50 questions! rowSums( is.na( survdat[, paste0("q",1:50)] ) )
Occasionally, you might find that you want to do row-wise operations that don’t have easy functions like rowSums. In these cases, the tidyverse function rowwise() can be useful in that it applies an operation to one row at a time:
survdat |>rowwise() |>mutate(# find the minimum score given for each rowmaxscore =min(across(q1:q50)), # put the column after the height column.after = height )