Introduction to psychometric testing

Data Analysis for Psychology in R 3

Dr John Martindale

Psychology, PPLS

University of Edinburgh

Course Overview

multilevel modelling
working with group structured data
regression refresher
introducing multilevel models
more complex groupings
centering, assumptions, and diagnostics
recap
factor analysis
working with multi-item measures
what is a psychometric test?
using composite scores to simplify data (PCA)
uncovering underlying constructs (EFA)
more EFA
recap

What is measurement?

“The process of assigning numbers to represent properties” (Campbell, 1920) “The assignment of numbers to objects or events according to rules” (Stevens, 1947)

  • Measurement is foundation of science

  • Scientists use measurement tools to produce quantitative data

  • Up to now you may have taken the composition of your data sets for granted. Key question in this block:

    • What do our numbers represent?

What is psychological measurement?

  • Many psychological phenomena cannot be observed directly: thoughts, feelings, behaviours etc.

  • Phenomena based in natural language that people discuss everyday (e.g., aggression, intelligence)

  • Scientific definitions often diverge from non-scientific definitions

    • Definitions fuzzy and lack consensus
  • Result: Confusion and complexity in measurement

What are constructs?

  • Constructs: useful abstractions about the world that are derived from natural observations


  • Simplify the world and provide a shared language for scientific study

    • Can be used to study the same phenomena across diverse contexts

    • Example: What does Leadership look like in hunter-gatherer societies, in the military and in the music industry?

Example of complexity: Life satisfaction

Are older people more satisfied with life? 112 people from 12 different dwellings (cities/towns) in Scotland. Information on their ages and some measure of life satisfaction.

d3 <- read_csv("https://uoepsy.github.io/data/lmm_lifesatscot.csv")
head(d3)
# A tibble: 6 × 4
    age lifesat dwelling size 
  <dbl>   <dbl> <chr>    <chr>
1    40      31 Aberdeen >100k
2    45      56 Glasgow  >100k
3    40      51 Glasgow  >100k
4    40      55 Dundee   >100k
5    40      41 Dundee   >100k
6    55      69 Perth    <100k


  • Did anyone stop to think - What is lifesat (i.e., life satisfaction)?

  • Discussion: define life satisfaction.

Constructs, measures and observations

Impact of differences in perspectives

  • Different operationalisations make it difficult to consolidate findings:

    • Jingle fallacy - Using same name to denote different things
    • Jangle fallacy - Using different names to denote same thing
  • “Nobody wants to use somebody else’s toothbrush” (Elson et al., 2023)

Psychometrics

  • Scientific discipline concerned with the construction of psychological measurements

  • Connects observable phenomena (e.g., item responses) to theoretical attributes (e.g., life satisfaction)

    • Theoretical constructs are defined by their domains of observable behaviours


  • Psychometricians study conceptual and statistical foundations of constructs, the measures that operationalise them and the models used to represent them

  • Applications across many sciences (e.g., psychology, behavioural genetics, neuroscience, political science, medicine)

Types of psychometric tests

  • Tests of typical performance

    • What participants do on a regular basis
    • Examples: Interests, values, personality traits, political beliefs
    • Real-world example: “Which Harry Potter house are you in?”
  • Tests of maximal performance

    • What can participants do when exerting maximum effort
    • Examples: Aptitude tests, exams, IQ tests
    • Real-world example: Duolingo, Wordle, revision apps


  • For the most part, the same statistical models are used to evaluate both

Applications of psychometric tests

  • Education

    • Aptitude / ability tests (i.e., standard school tests)
    • Vocational tests
  • Business

    • Selection (e.g., personality, skills)
    • Development (e.g., interests, leadership)
    • Performance (e.g., well-being, engagement)
  • Health

    • Mental health symptoms e.g., anxiety
    • Clinical diagnoses e.g., personality disorders
  • Key takeaway: People make life-changing decisions using psychometric evidence every day

Criteria for good psychometrics

  • Lots of important applications, so psychometrics must:

    • Assess what they are supposed to assess
    • Be consistent and reliable
    • Produce interpretable scores
    • Be relevant for specific populations
    • Differentiate between people in fair way


  • In this course we will cover first three and how psychologists evaluate, last two are context-dependent

Diagrammatic conventions

  • In this section of the course we distinguish between variables that are:

    • Square = Observed / measure

    • Circle = Latent / unobserved

    • Two-headed arrow = Covariance

    • Single headed arrow = Regression path

Representational not actual measurement

  • We cannot take our ruler and measure life satisfaction

  • Create tests and hope responses tell us something about the construct we are interested in


  • Important: Data is only ever item responses, not the construct itself


  • Psychometrics is pseudo-representational: useful representation of target construct rather than ‘ground truth’ of universe

Measurement error

All measurement is befuddled by error

McNemar (1946, p.294)

  • Every measurement we take contains some error, goal is to minimise it

  • Error can be:

    • Random = Unpredictable. Inconsistent values due to something specific to the measurement occasion

    • Systematic = Predictable. Consistent alteration of the observed score due to something constant about the measurement tool

  • Can you think of any examples?

Unit of analysis: correlations and covariance

  • Unit of analysis is covariance

    • Variance = Deviance around the mean of a single variable
    • Covariance = Representation of how two variables change together
    • Correlation = Standardised version of covariance


  • We are trying to explain patterns in the correlation matrix

    • i.e. among a set of items

Can you see any patterns of inter-relations in the below correlation matrix?


lsat_data <- read_csv("data/lifesat.csv")
round(cor(lsat_data),2)
        lfsat_1 lfsat_2 lfsat_3 lfsat_4 lfsat_5 lfsat_6
lfsat_1    1.00    0.72    0.71    0.00    0.03    0.01
lfsat_2    0.72    1.00    0.73   -0.01    0.01   -0.03
lfsat_3    0.71    0.73    1.00   -0.01    0.01   -0.01
lfsat_4    0.00   -0.01   -0.01    1.00    0.72    0.71
lfsat_5    0.03    0.01    0.01    0.72    1.00    0.75
lfsat_6    0.01   -0.03   -0.01    0.71    0.75    1.00

Scale scores

Classical test theory (CTT)

  • Classical test theory describes scores on any measure as a combination of signal (i.e., true score) and noise (i.e., error):

\[ \begin{equation} \text{Obsered score = True score + Error} \end{equation} \]

  • Our test measures some ability or trait, and in the world there is a “true” value of score on this test for each individual

  • Observed score unlikely to reflect the participants’ true value of construct

CTT diagram

  • True score = Variance in the score explained by the target construct

  • Error = Variance in the score explained by other things (i.e., random or systematic)

  • Observed score = What we actually record in the dataset

  • Goal of testing is to minimise error in observed scores

Scoring in CTT

  • Items summed or averaged (i.e., the mean) to create a score for the target construct

  • Example of how to create mean scores in R:

lsat_data <- lsat_data %>%
  rowwise() %>%
  mutate(
    lfsat_mean1 = mean(c(lfsat_1, lfsat_2, lfsat_3)),
    lfsat_mean2 = mean(c(lfsat_4, lfsat_5, lfsat_6))
  )

head(lsat_data)
# A tibble: 6 × 8
# Rowwise: 
  lfsat_1 lfsat_2 lfsat_3 lfsat_4 lfsat_5 lfsat_6 lfsat_mean1 lfsat_mean2
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>       <dbl>       <dbl>
1       3       3       4       2       3       2        3.33        2.33
2       4       4       3       3       4       4        3.67        3.67
3       5       5       4       1       2       2        4.67        1.67
4       3       4       4       4       3       4        3.67        3.67
5       1       1       1       2       2       3        1           2.33
6       4       3       3       2       2       2        3.33        2   

How do we evaluate scores?

  • Scores are created by aggregating responses to multiple items

  • Groups of items measuring same construct referred to as scales

  • One or more related scales are administered as a measure, test, or battery

    • Makes a measure multidimensional
  • How do we assess the performance of our scaless?

Evaluating psychometric tests: Reliability

What is reliability?

  • Consistency of test results across multiple administrations

  • Important: Test can be highly reliable but not at all valid, depending on construct

    • Tape measure is reliable measure of stars, not of leadership!


  • Reliability thus less ambiguous than validity

    • Validity to some degree “in the eye of the beholder”

Parallel tests

  • Charles Spearman was the first to note that, under certain assumptions (i.e., tests are truly parallel, each item measures construct to same extent) correlations between two parallel tests provide estimate of reliability


  • Parallel tests can come from several sources

    • Time tests were administered (test-retest)

    • Multiple raters (inter-rater reliability)

    • Items (alternate forms, split-half, internal consistency)

Test-retest reliability

  • Correlation between tests taken at 2+ points in time (assumed to be equivalent)

  • Corner-stone of test assessment and appears in many test manuals, but there are some tricky conceptual questions:

    • What’s the appropriate time between when measures are taken?

    • How stable should the construct be if we are to consider it a trait?

Inter-rater reliability

  • Ask a set of judges to rate a set of targets, compare similarity

    • Get friends to rate the personality of a family member

    • Get zoo keepers to rate the subjective well-being of an animal

  • We can determine how consistent raters are across:

    • Their individual estimates (i.e., across targets)

    • How reliable is the average estimate based on the judges’ ratings (i.e., across raters)

Alternate forms and split-half reliability

  • Correlation between two variants of a test:

    • Same items in different order (randomise the stimuli)

    • Tests with similar, but not identical content (e.g., tests with fixed number of numerical problems)

  • Assumption: If the tests are perfectly reliable, they should correlate perfectly (they won’t, estimate = reliability)


  • Split-half reliability: Split test into equal halves, score them up and correlate the halves

Internal consistency

  • Extent to which items correlate with each other within a scale

  • Most common assessment of reliability

    • Easy and cheap, all at one time-point with one set of items
  • Calculated through some form of average covariance / average (variance + covariance)


  • Multiple ways to estimate:

    • Cronbach’s Alpha = Assumes all items equivalent measures of construct
    • McDonald’s Omega = Does not, more on this in W4

Evaluating psychometric tests: Validity

What is validity?

Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. Validity> is, therefore, the most fundamental consideration in developing tests and evaluating tests. The process of validation involves accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of the test scores for the proposed uses that are valuated, not the test itself.

Standard for Educational and Psychological Testing

Debates about the definition

Whether a test really measures what it purports to measure* (Kelley, 1927)

How well a test does the job it is employed to do. The same may be used for … different purposes and its validity may be high for one, moderate for another and low for a third (Cureton, 1951)

Validity is “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment (Messick, 1989)

A test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement outcomes (Borsboom et al., 2004)

Evidence for validity

  • Validity is more nebulous and debated than reliability

    • Debates about how to define validity lead to questions about what constitutes evidence

Contemporary perspective:

The goal of psychometric development so generate a psychometric that accurately measures the intended construct, as precisely as possible, and that uses of the psychometric are appropriate for the given purpose, population, and context.

(Hughes, 2018, p. 22)

Content evidence

  • Content / construct validity

    • A test should contain only content relevant to the intended construct

    • It should measure what it was intended to measure

    • “Easy” for questionnaire items, hard for tasks / implicit measures


  • Face validity

    • i.e., for those taking the test, does the test “appear to” measure what it was designed to measure?

Response processes

All measurement in a test occurs between the participant reading the item and selecting a response

  • Everything we have discussed so far is analysed after measurement

  • Need to assess content / construct validity during data generating process (i.e. when completing the questionnaire)


  • Can do this using qualitative think-aloud-protocol interviews:

    • Participants complete questionnaire, select response option and verbalise reason / self-construal / opinion

Response processes

  • Example of a think-aloud-protocol output:

Structural validity

  • Many constructs are multi-dimensional i.e. they have multiple underlying components

    • e.g. Narcissism = Grandiosity + Vulnerability + Antagonism


  • Goal is to assess whether the the items ‘fit’ this structure

  • Then assess stability of structure across samples / time / groups

  • Most commonly assessed using exploratory / confirmatory factor analysis

    • Distinction explained in week 4

Relationships with other constructs

  • Convergent: Measure should have high correlations with other measures of the same construct

  • Discriminant: Measure should have low correlations with measures of different constructs

  • Nomological net:

    • Measure should have expected relations (positive/negative) correlations with other constructs

    • Also, some measures should vary depending on manipulations (e.g., a measure of “stress” should be higher when about to take exam)

Relationships with other constructs

  • Consider relations in terms of temporal sequence

  • Concurrent validity: Correlations with contemporaneous measures e.g.:

    • Neuroticism and subjective well-being
    • Extraversion and leadership
  • Predictive validity: Related to expected future outcomes e.g.:

    • IQ and health
    • Agreeableness and future income

Consequences

  • Perhaps most controversial aspect of current validity discussions

  • Evaluate test based on what it reveals (e.g., differences between groups) / decisions that are made based on the results

  • Should potential consequences of test use be considered part of the evidence for test’s validity?

  • Important questions for the use of tests

    • Is my measure systematically biased or fair for all groups of test takers?
    • Does the bias have social ramifications?

Relationship between reliability and validity

  • Reliability: relation of true score to observed score

  • Validity: correlations with other measures play a key role


  • Low reliability: Correlations between observed variables are attenuated and underestimated

  • Reliability is thus the ceiling for validity: tests cannot correlate with each other more than they correlate with themselves

Where can you find this information?

  • Test manuals (should) contain all information needed to assess reliability and validity

  • Papers describing new tests and papers investigating exisitng measures in different groups, languages, contexts, etc.

    • Assessment
    • Psychological Assessment
    • European Journal of Psychological Assessment
    • Organisational Research Methods
    • Personality journals
  • Papers describing new ways to establish reliability, validity, etc found in:

    • Behaviour Research Methods
    • Psychometrika
    • Multivariate Behavioural Research

Other methods of scoring

Assumption of classical test theory

  • Assumption of classical test theory: indicators are equivalent (i.e., all items measure the construct to the same extent)

    • Is this a realistic assumption for psychological tests?
  • Do “I am never dissatisfied with my life” and “I am relatively happy about my life” both measure life satisfaction to the same extent?

Unit-weighted scores

  • Mean score from set of items assumes all contribute equally

  • Equivalent to multiplying each observation by 1 before summing

  • Do both items contribute the same to life satisfaction?

Weighted scores

  • Weighted scores created by multiplying each observation by unique weight before averaging / summing

  • Allows each item to contribute in unique way

  • More realistic representation of psychometric items?

How do we identify the weights?

  • Dimension reduction models reveal relationships between items and underlying dimensions (i.e., aggregations of multiple items)

  • Two most common in psychology: Principal components analysis (PCA) and Factor analysis (FA)

    • PCA = items –> component

    • FA = latent variable –> items

  • Important: Also use these techniques to assign items to scales, focus next week

Summary

This week

  • Psychometrics is study of how to measure psychological constructs

  • We create scale scores by aggregating indicators (i.e., items)

  • Psychometric scores / tests evaluated using:

    • Reliability: How consistent is measurement
    • Validity: Am I measuring what I want
  • Multiple ways of aggregating items / creating scores: sums / means, factor analysis, principal components analysis