Introduction to psychometric testing

Data Analysis for Psychology in R 3

Dr John Martindale

Psychology, PPLS

University of Edinburgh

Course Overview

multilevel modelling working with group structured data	regression refresher
	introducing multilevel models
	more complex groupings
	centering, assumptions, and diagnostics
	recap
factor analysis working with multi-item measures	what is a psychometric test?
	using composite scores to simplify data (PCA)
	uncovering underlying constructs (EFA)
	more EFA
	recap

What is measurement?

“The process of assigning numbers to represent properties” (Campbell, 1920) “The assignment of numbers to objects or events according to rules” (Stevens, 1947)

Measurement is foundation of science
Scientists use measurement tools to produce quantitative data
Up to now you may have taken the composition of your data sets for granted. Key question in this block:
- What do our numbers represent?

What is psychological measurement?

Many psychological phenomena cannot be observed directly: thoughts, feelings, behaviours etc.
Phenomena based in natural language that people discuss everyday (e.g., aggression, intelligence)
Scientific definitions often diverge from non-scientific definitions
- Definitions fuzzy and lack consensus
Result: Confusion and complexity in measurement

What are constructs?

Constructs: useful abstractions about the world that are derived from natural observations

Simplify the world and provide a shared language for scientific study
- Can be used to study the same phenomena across diverse contexts
- Example: What does Leadership look like in hunter-gatherer societies, in the military and in the music industry?

Example of complexity: Life satisfaction

Are older people more satisfied with life? 112 people from 12 different dwellings (cities/towns) in Scotland. Information on their ages and some measure of life satisfaction.

d3 <- read_csv("https://uoepsy.github.io/data/lmm_lifesatscot.csv")
head(d3)

# A tibble: 6 × 4
    age lifesat dwelling size 
  <dbl>   <dbl> <chr>    <chr>
1    40      31 Aberdeen >100k
2    45      56 Glasgow  >100k
3    40      51 Glasgow  >100k
4    40      55 Dundee   >100k
5    40      41 Dundee   >100k
6    55      69 Perth    <100k

Did anyone stop to think - What is lifesat (i.e., life satisfaction)?
Discussion: define life satisfaction.

Constructs, measures and observations

Impact of differences in perspectives

Different operationalisations make it difficult to consolidate findings:
- Jingle fallacy - Using same name to denote different things
- Jangle fallacy - Using different names to denote same thing
“Nobody wants to use somebody else’s toothbrush” (Elson et al., 2023)

Psychometrics

Scientific discipline concerned with the construction of psychological measurements
Connects observable phenomena (e.g., item responses) to theoretical attributes (e.g., life satisfaction)
- Theoretical constructs are defined by their domains of observable behaviours

Psychometricians study conceptual and statistical foundations of constructs, the measures that operationalise them and the models used to represent them
Applications across many sciences (e.g., psychology, behavioural genetics, neuroscience, political science, medicine)

Types of psychometric tests

Tests of typical performance
- What participants do on a regular basis
- Examples: Interests, values, personality traits, political beliefs
- Real-world example: “Which Harry Potter house are you in?”
Tests of maximal performance
- What can participants do when exerting maximum effort
- Examples: Aptitude tests, exams, IQ tests
- Real-world example: Duolingo, Wordle, revision apps

For the most part, the same statistical models are used to evaluate both

Applications of psychometric tests

Education
- Aptitude / ability tests (i.e., standard school tests)
- Vocational tests
Business
- Selection (e.g., personality, skills)
- Development (e.g., interests, leadership)
- Performance (e.g., well-being, engagement)
Health
- Mental health symptoms e.g., anxiety
- Clinical diagnoses e.g., personality disorders
Key takeaway: People make life-changing decisions using psychometric evidence every day

Criteria for good psychometrics

Lots of important applications, so psychometrics must:
- Assess what they are supposed to assess
- Be consistent and reliable
- Produce interpretable scores
- Be relevant for specific populations
- Differentiate between people in fair way

In this course we will cover first three and how psychologists evaluate, last two are context-dependent

Diagrammatic conventions

In this section of the course we distinguish between variables that are:
- Square = Observed / measure
- Circle = Latent / unobserved
- Two-headed arrow = Covariance
- Single headed arrow = Regression path

Representational not actual measurement

We cannot take our ruler and measure life satisfaction
Create tests and hope responses tell us something about the construct we are interested in

Important: Data is only ever item responses, not the construct itself

Psychometrics is pseudo-representational: useful representation of target construct rather than ‘ground truth’ of universe

Measurement error

All measurement is befuddled by error

McNemar (1946, p.294)

Every measurement we take contains some error, goal is to minimise it
Error can be:
- Random = Unpredictable. Inconsistent values due to something specific to the measurement occasion
- Systematic = Predictable. Consistent alteration of the observed score due to something constant about the measurement tool
Can you think of any examples?

Unit of analysis: correlations and covariance

Unit of analysis is covariance
- Variance = Deviance around the mean of a single variable
- Covariance = Representation of how two variables change together
- Correlation = Standardised version of covariance

We are trying to explain patterns in the correlation matrix
- i.e. among a set of items

Can you see any patterns of inter-relations in the below correlation matrix?

lsat_data <- read_csv("data/lifesat.csv")
round(cor(lsat_data),2)

        lfsat_1 lfsat_2 lfsat_3 lfsat_4 lfsat_5 lfsat_6
lfsat_1    1.00    0.72    0.71    0.00    0.03    0.01
lfsat_2    0.72    1.00    0.73   -0.01    0.01   -0.03
lfsat_3    0.71    0.73    1.00   -0.01    0.01   -0.01
lfsat_4    0.00   -0.01   -0.01    1.00    0.72    0.71
lfsat_5    0.03    0.01    0.01    0.72    1.00    0.75
lfsat_6    0.01   -0.03   -0.01    0.71    0.75    1.00

Scale scores

Classical test theory (CTT)

Classical test theory describes scores on any measure as a combination of signal (i.e., true score) and noise (i.e., error):

\[ \begin{equation} \text{Obsered score = True score + Error} \end{equation} \]

Our test measures some ability or trait, and in the world there is a “true” value of score on this test for each individual
Observed score unlikely to reflect the participants’ true value of construct

CTT diagram

True score = Variance in the score explained by the target construct
Error = Variance in the score explained by other things (i.e., random or systematic)
Observed score = What we actually record in the dataset
Goal of testing is to minimise error in observed scores

Scoring in CTT

Items summed or averaged (i.e., the mean) to create a score for the target construct
Example of how to create mean scores in R:

lsat_data <- lsat_data %>%
  rowwise() %>%
  mutate(
    lfsat_mean1 = mean(c(lfsat_1, lfsat_2, lfsat_3)),
    lfsat_mean2 = mean(c(lfsat_4, lfsat_5, lfsat_6))
  )

head(lsat_data)

# A tibble: 6 × 8
# Rowwise: 
  lfsat_1 lfsat_2 lfsat_3 lfsat_4 lfsat_5 lfsat_6 lfsat_mean1 lfsat_mean2
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>       <dbl>       <dbl>
1       3       3       4       2       3       2        3.33        2.33
2       4       4       3       3       4       4        3.67        3.67
3       5       5       4       1       2       2        4.67        1.67
4       3       4       4       4       3       4        3.67        3.67
5       1       1       1       2       2       3        1           2.33
6       4       3       3       2       2       2        3.33        2

How do we evaluate scores?

Scores are created by aggregating responses to multiple items
Groups of items measuring same construct referred to as scales
One or more related scales are administered as a measure, test, or battery
- Makes a measure multidimensional
How do we assess the performance of our scaless?

Evaluating psychometric tests: Reliability

What is reliability?

Consistency of test results across multiple administrations
Important: Test can be highly reliable but not at all valid, depending on construct
- Tape measure is reliable measure of stars, not of leadership!

Reliability thus less ambiguous than validity
- Validity to some degree “in the eye of the beholder”

Parallel tests

Charles Spearman was the first to note that, under certain assumptions (i.e., tests are truly parallel, each item measures construct to same extent) correlations between two parallel tests provide estimate of reliability

Parallel tests can come from several sources
- Time tests were administered (test-retest)
- Multiple raters (inter-rater reliability)
- Items (alternate forms, split-half, internal consistency)

Test-retest reliability

Correlation between tests taken at 2+ points in time (assumed to be equivalent)
Corner-stone of test assessment and appears in many test manuals, but there are some tricky conceptual questions:
- What’s the appropriate time between when measures are taken?
- How stable should the construct be if we are to consider it a trait?

Inter-rater reliability

Ask a set of judges to rate a set of targets, compare similarity
- Get friends to rate the personality of a family member
- Get zoo keepers to rate the subjective well-being of an animal
We can determine how consistent raters are across:
- Their individual estimates (i.e., across targets)
- How reliable is the average estimate based on the judges’ ratings (i.e., across raters)

Alternate forms and split-half reliability

Correlation between two variants of a test:
- Same items in different order (randomise the stimuli)
- Tests with similar, but not identical content (e.g., tests with fixed number of numerical problems)
Assumption: If the tests are perfectly reliable, they should correlate perfectly (they won’t, estimate = reliability)

Split-half reliability: Split test into equal halves, score them up and correlate the halves

Internal consistency

Extent to which items correlate with each other within a scale
Most common assessment of reliability
- Easy and cheap, all at one time-point with one set of items
Calculated through some form of average covariance / average (variance + covariance)

Multiple ways to estimate:
- Cronbach’s Alpha = Assumes all items equivalent measures of construct
- McDonald’s Omega = Does not, more on this in W4

Evaluating psychometric tests: Validity

What is validity?

Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. Validity> is, therefore, the most fundamental consideration in developing tests and evaluating tests. The process of validation involves accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of the test scores for the proposed uses that are valuated, not the test itself.

Standard for Educational and Psychological Testing

Debates about the definition

Whether a test really measures what it purports to measure* (Kelley, 1927)

How well a test does the job it is employed to do. The same may be used for … different purposes and its validity may be high for one, moderate for another and low for a third (Cureton, 1951)

Validity is “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment (Messick, 1989)

A test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement outcomes (Borsboom et al., 2004)

Evidence for validity

Validity is more nebulous and debated than reliability
- Debates about how to define validity lead to questions about what constitutes evidence

Contemporary perspective:

The goal of psychometric development so generate a psychometric that accurately measures the intended construct, as precisely as possible, and that uses of the psychometric are appropriate for the given purpose, population, and context.

(Hughes, 2018, p. 22)

Content evidence

Content / construct validity
- A test should contain only content relevant to the intended construct
- It should measure what it was intended to measure
- “Easy” for questionnaire items, hard for tasks / implicit measures

Face validity
- i.e., for those taking the test, does the test “appear to” measure what it was designed to measure?

Response processes

All measurement in a test occurs between the participant reading the item and selecting a response

Everything we have discussed so far is analysed after measurement
Need to assess content / construct validity during data generating process (i.e. when completing the questionnaire)

Can do this using qualitative think-aloud-protocol interviews:
- Participants complete questionnaire, select response option and verbalise reason / self-construal / opinion

Response processes

Example of a think-aloud-protocol output:

Structural validity

Many constructs are multi-dimensional i.e. they have multiple underlying components
- e.g. Narcissism = Grandiosity + Vulnerability + Antagonism

Goal is to assess whether the the items ‘fit’ this structure
Then assess stability of structure across samples / time / groups
Most commonly assessed using exploratory / confirmatory factor analysis
- Distinction explained in week 4

Relationships with other constructs

Convergent: Measure should have high correlations with other measures of the same construct
Discriminant: Measure should have low correlations with measures of different constructs
Nomological net:
- Measure should have expected relations (positive/negative) correlations with other constructs
- Also, some measures should vary depending on manipulations (e.g., a measure of “stress” should be higher when about to take exam)

Relationships with other constructs

Consider relations in terms of temporal sequence
Concurrent validity: Correlations with contemporaneous measures e.g.:
- Neuroticism and subjective well-being
- Extraversion and leadership
Predictive validity: Related to expected future outcomes e.g.:
- IQ and health
- Agreeableness and future income

Consequences

Perhaps most controversial aspect of current validity discussions
Evaluate test based on what it reveals (e.g., differences between groups) / decisions that are made based on the results
Should potential consequences of test use be considered part of the evidence for test’s validity?
Important questions for the use of tests
- Is my measure systematically biased or fair for all groups of test takers?
- Does the bias have social ramifications?

Relationship between reliability and validity

Reliability: relation of true score to observed score
Validity: correlations with other measures play a key role

Low reliability: Correlations between observed variables are attenuated and underestimated
Reliability is thus the ceiling for validity: tests cannot correlate with each other more than they correlate with themselves

Where can you find this information?

Test manuals (should) contain all information needed to assess reliability and validity
Papers describing new tests and papers investigating exisitng measures in different groups, languages, contexts, etc.
- Assessment
- Psychological Assessment
- European Journal of Psychological Assessment
- Organisational Research Methods
- Personality journals
Papers describing new ways to establish reliability, validity, etc found in:
- Behaviour Research Methods
- Psychometrika
- Multivariate Behavioural Research

Other methods of scoring

Assumption of classical test theory

Assumption of classical test theory: indicators are equivalent (i.e., all items measure the construct to the same extent)
- Is this a realistic assumption for psychological tests?
Do “I am never dissatisfied with my life” and “I am relatively happy about my life” both measure life satisfaction to the same extent?

Unit-weighted scores

Mean score from set of items assumes all contribute equally
Equivalent to multiplying each observation by 1 before summing
Do both items contribute the same to life satisfaction?

Weighted scores

Weighted scores created by multiplying each observation by unique weight before averaging / summing
Allows each item to contribute in unique way
More realistic representation of psychometric items?

How do we identify the weights?

Dimension reduction models reveal relationships between items and underlying dimensions (i.e., aggregations of multiple items)
Two most common in psychology: Principal components analysis (PCA) and Factor analysis (FA)
- PCA = items –> component
- FA = latent variable –> items
Important: Also use these techniques to assign items to scales, focus next week

Summary

This week

Psychometrics is study of how to measure psychological constructs
We create scale scores by aggregating indicators (i.e., items)
Psychometric scores / tests evaluated using:
- Reliability: How consistent is measurement
- Validity: Am I measuring what I want
Multiple ways of aggregating items / creating scores: sums / means, factor analysis, principal components analysis