When our research hypothesis involves a population slope, for example, we wish to test whether in a model such as \[ y = \beta_0 + \beta_1 x + \epsilon \] the slope coefficient is significantly different from 0 or not.

However, the population slope is an example of a parameter, a numerical summary of the population. Typically, we do not have data for the entire population, so we can only compute a numerical summary for a sample, called a statistic. We use the statistic \(\widehat \beta_1\) (= estimated regression slope) to estimate the unknown population parameter \(\beta_1\) (= true regression slope).

In the past weeks we have used the results from the t-test and corresponding p-values returned by the summary() function. However, such test results are valid only when the regression assumptions are satisfied.

For example, recall the Riverview data from Semester 1:

library(tidyverse)

riverview <- read_csv(file = "https://uoepsy.github.io/data/riverview.csv")

mdl <- lm(income ~ 1 + education, data = riverview)
summary(mdl)

## 
## Call:
## lm(formula = income ~ 1 + education, data = riverview)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.809  -5.783   2.088   5.127  18.379 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3214     6.1232   1.849   0.0743 .  
## education     2.6513     0.3696   7.173 5.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.978 on 30 degrees of freedom
## Multiple R-squared:  0.6317, Adjusted R-squared:  0.6194 
## F-statistic: 51.45 on 1 and 30 DF,  p-value: 5.562e-08

Another key question we were able to answer by looking at the summary() output, was “How accurate can we expect the estimates to be?” This was given by the standard error (SE in short), reported in the Std. Error column. The SE tells us the size of a typical “estimation error,” i.e. a typical distance between an estimate for a given sample and the true but unknown population value.

Clearly, now you might wonder, what should we do then if there is no way to satisfy the regression assumptions by transforming variables?

Here’s the answer: bootstrap!

This week we examine bootstrapping, which is another technique for drawing inferences to the population from a regression model. This method is assumption-free and does not rely on conditions such as normality of the residuals.

Before continuing, this table summarises the new terms that will be explained throughout the exercises.

Bootstrap Terminology

A parameter is a numerical summary for the population, e.g. the population slope \(\beta_1\)
A statistic is a numerical summary calculated from the sample data, e.g. the estimated slope in the sample \(\widehat \beta_1\)
A bootstrap sample is chosen with replacement from an existing sample, using the same sample size.
A bootstrap statistic is a statistic computed for each bootstrap sample.
A bootstrap distribution collects bootstrap statistics for many bootstrap samples.

Age, Knowledge and Attitudes About Science

This week’s lab explores whether attitudes towards science and faith are related to knowledge about science and to age, using a subset of data from the 2005 Eurobarometer 63.1 survey.

The research question guiding this example is:

Is age and knowledge about science associated with attitudes towards science and faith?

We can also state this in the form of two research hypotheses:

There is a significant linear relationship between people’s age and their attitudes to science and faith after accounting for their scientific knowledge.
There is a significant linear relationship between people’s scientific knowledge and their attitudes to science and faith after accounting for their age.

The Data

Download the data here: https://uoepsy.github.io/data/science-faith-attitude.csv

This example uses three variables from the Eurobarometer 63.1 survey:

Variable	Description
`kstot`	Score on a science “quiz” composed of 13 true/false items.
`toomuchscience`	Attitude to science and faith (question wording: “We rely too much on science and not enough on faith”; responses on a 5-point scale from strongly disagree to strongly agree).
`age`	Age measured in years.

The science knowledge quiz has a range of 0 to 13. Its mean is about 8.7. The attitude to science and faith question has five categories, ranging from 0 to 4, with a mean of about 2.5. Age has a range of 15–93, with a mean of about 45. This is formally an ordinal variable but, in line with common practice in applied research, we regard it as continuous, as we do the other two variables as well.

Question 1

Read the data into R, and call the data ebsurvey. Remember, you can download the data here: https://uoepsy.github.io/data/science-faith-attitude.csv
How many observations and how many variables are there?
Today we will be only using the kstot, age, and toomuchscience columns. Subset the data to only have those 3 columns.
Is there any missing values? If yes, remove the rows with missing values.

Solution

library(tidyverse)

# Read the data
ebsurvey <- read_csv('https://uoepsy.github.io/data/science-faith-attitude.csv')
# Inspect top 6 rows
head(ebsurvey)

## # A tibble: 6 x 7
##      v5    v6 kstot   age  male toomuchscience solveprob
##   <dbl> <dbl> <dbl> <dbl> <dbl>          <dbl>     <dbl>
## 1     1     2    12    35     1              2         0
## 2     2     2    11    34     0              1         1
## 3     3     2     7    40     0             NA        NA
## 4     4     2     9    32     0              3         0
## 5     5     2     6    35     0              3         1
## 6     6     2    11    67     1             NA        NA

# Check data dimensions
dim(ebsurvey)

## [1] 21886     7

There are 21886 observations on 7 variables.

However, today we will be only using the kstot, age, and toomuchscience variables. We can subset the data:

ebsurvey <- ebsurvey %>%
    select(kstot, age, toomuchscience)

Are there any NA values in the data?

anyNA(ebsurvey)

## [1] TRUE

# Yes, so we omit them
ebsurvey <- na.omit(ebsurvey)
# Check new data dimensions
dim(ebsurvey)

## [1] 10503     3

Question 2

Give the variables more meaningful names. Rename kstot to science_knowledge and rename toomuchscience to attitude.

Hint: have a look at the help page of the function rename.

Solution

ebsurvey <- ebsurvey %>%
    rename(science_knowledge = kstot,
           attitude = toomuchscience)
head(ebsurvey)

## # A tibble: 6 x 3
##   science_knowledge   age attitude
##               <dbl> <dbl>    <dbl>
## 1                12    35        2
## 2                11    34        1
## 3                 9    32        3
## 4                 6    35        3
## 5                 9    37        1
## 6                 5    63        2

Before producing multiple regression models, it is a good idea to look at each variable separately.

Question 3

Explore the distribution of age in years, age.

Solution

ggplot(ebsurvey, aes(x = age)) +
    geom_histogram(color = 'white', binwidth = 5) +
    labs(x = 'Age (years)', 
         y = 'Frequency')

We can obtain summary statistics as follows:

library(kableExtra)

ebsurvey %>%
    summarise(Min = min(age),
              Median = quantile(age, 0.5),
              IQR = IQR(age),
              Mean = mean(age),
              SD = sd(age),
              Max = max(age)) %>%
    kable(digits = 2, 
          caption = 'Descriptive statistics of age') %>%
    kable_styling(full_width = FALSE)

Table 1: Descriptive statistics of age
Min	Median	IQR	Mean	SD	Max
15	44	27	44.93	17.32	93

The mean age in the sample is about 45 years with a standard deviation of just over 17 years. The distribution looks approximately normal, with a slight positive skew.

Question 4

Explore the distribution of science knowledge quiz scores, science_knowledge.

Solution

An histogram of science knowledge quiz scores is provided below:

ggplot(ebsurvey, aes(x = science_knowledge)) +
    geom_histogram(binwidth = 1, color = 'white') +
    labs(x = 'Science knowledge quiz scores', 
         y = 'Frequency')

We can obtain summary statistics as follows:

ebsurvey %>%
    summarise(Min = min(science_knowledge),
              Median = quantile(science_knowledge, 0.5),
              IQR = IQR(science_knowledge),
              Mean = mean(science_knowledge),
              SD = sd(science_knowledge),
              Max = max(science_knowledge)) %>%
    kable(digits = 2, 
          caption = 'Descriptive statistics of science knowledge scores') %>%
    kable_styling(full_width = FALSE)

Table 2: Descriptive statistics of science knowledge scores
Min	Median	IQR	Mean	SD	Max
0	9	4	8.68	2.61	13

The histogram shows that the majority of values on the science knowledge quiz score cluster between about 5 and 11. There is a slight negative skew to the distribution. Overall there is little reason for concern as to the appropriateness of the variable for inclusion.

Question 5

Explore the distribution of attitude to science and faith scores, attitude.

Solution

ggplot(ebsurvey, aes(x = attitude)) +
    geom_histogram(color = 'white', binwidth = 1) +
    labs(x = 'We rely too much on science and not enough on faith', 
         y = 'Frequency')

We can obtain summary statistics as follows:

ebsurvey %>%
    summarise(Min = min(attitude),
              Median = quantile(attitude, 0.5),
              IQR = IQR(attitude),
              Mean = mean(attitude),
              SD = sd(attitude),
              Max = max(attitude)) %>%
    kable(digits = 2, 
          caption = 'Descriptive statistics of science attitude') %>%
    kable_styling(full_width = FALSE)

Table 3: Descriptive statistics of science attitude
Min	Median	IQR	Mean	SD	Max
0	2	2	2.2	1.22	4

The mean score on the science and faith attitude variable is just over 2. There are only 5 discrete values possible in the distribution, based on the response options available, but the distribution looks approximately normal, with a slight negative skew.

Question 6

Visualise the pairwise relationships between your variables and explore the possible correlations.

Solution

library(patchwork)

p1 <- ggplot(ebsurvey, aes(age, science_knowledge)) +
    geom_point() +
    labs(x = 'Age (years)',
         y = 'Science knowledge quiz scores')

p2 <- ggplot(ebsurvey, aes(age, attitude)) +
    geom_point() +
    labs(x = 'Age (years)',
         y = 'Science attitude')

p3 <- ggplot(ebsurvey, aes(science_knowledge, attitude)) +
    geom_point() +
    labs(x = 'Science knowledge quiz scores',
         y = 'Science attitude')

p1 | p2 | p3

From the pairwise scatterplots, it does not seem like there is a strong linear dependence of attitude to science and faith on a person’s age and science knowledge.

Correlation matrix:

cor(ebsurvey)

##                   science_knowledge         age    attitude
## science_knowledge         1.0000000 -0.12239434 -0.17634507
## age                      -0.1223943  1.00000000  0.05496583
## attitude                 -0.1763451  0.05496583  1.00000000

In this case, the Pearson correlation coefficient between age and science_knowledge is -0.12. The correlation is relatively small in absolute terms, and we therefore have little concern about multicollinearity influencing this regression analysis.

The correlation between attitude and age is 0.05, while with science_knowledge it is -0.17. So, overall there is a very weak linear relationship.

Now that we have explored each variable by itself, we can estimate the multiple regression model.

\[ \texttt{attitude}_i = \beta_0 + \beta_1 \ \texttt{science_knowledge}_i + \beta_2 + \texttt{age}_i + \epsilon_i \]

Regression results are often presented in a table that reports the coefficient estimates, their estimated standard errors, t-scores, and levels of statistical significance.

Question 7

Fit the model specified above using R
Check for violations of the model assumptions.

Solution

mdl <- lm(attitude ~ science_knowledge + age, data = ebsurvey)

par(mfrow = c(2,2))
plot(mdl)

If you are doubting that the model assumptions are satisfied, don’t throw your PC up in the air, but rather keep reading!

The Bootstrap

The bootstrap is a general approach to assessing whether the sample results are statistically significant or not, which does not rely on specific distributional assumptions such as normality of the errors. It is based on sampling repeatedly with replacement (to avoid always getting the original sample exactly) from the data at hand, and then computing the regression coefficients from each resample. We will equivalently use the word bootstrap sample or resample (for SAMPLE with REplacement).

The basic principle is:

The population is to the original sample

the original sample is to the bootstrap samples.

Because we only have one sample of size \(n\), and we do not have access to the data for the entire population, we consider our original sample as our best approximation to the population. To be more precise, we assume that the population is made up of many, many copies of our original sample. Then, we take multiple samples each of size \(n\) from this assumed population. This is equivalent to sampling with replacement from the original sample.

We will explain, without loss of generality, the bootstrap for regression in the simple case where our sample data consist of measurements on a response \(y\) and predictor \(x\) for a sample of \(n\) individuals (or units):

\[ \begin{matrix} \text{Individual, }i & \text{Response, }y & \text{Predictor, }x \\ 1 & y_1 & x_1 \\ 2 & y_2 & x_2 \\ \vdots & \vdots & \vdots\\ n & y_n & x_n \\ \end{matrix} \]

We can write this in compact form by saying that we have sample data comprising \(n\) pairs of (response, predictor) data. This is our original sample: \[ (y_1, x_1), (y_2, x_2), \dots, (y_n, x_n) \]

For the original sample, we can obtain estimated intercept and slope: \[ \begin{aligned} \widehat \beta_0 &= \text{estimated intercept for original sample } (y_1, x_1), (y_2, x_2), \dots, (y_n, x_n)\\ \widehat \beta_1 &= \text{estimated slope for original sample } (y_1, x_1), (y_2, x_2), \dots, (y_n, x_n) \end{aligned} \]

To obtain one bootstrap sample from the original sample, sample \(n\) pairs with replacement from the original sample. Denote the new \(n\) pairs with an asterisk. Note that we can have repetitions of pairs from the original sample: \[ (y_1, x_1)^*, (y_2, x_2)^*, \dots, (y_n, x_n)^* \]

Then, we can fit the linear model to the data in the bootstrap sample, and compute the regression coefficients in the bootstrap sample, \(\widehat \beta_0^*\) and \(\widehat \beta_1^*\). These two regression coefficients are examples of bootstrap statistics. We call bootstrap statistic any numerical summary of the bootstrap sample, in the same way that a statistic is a numerical summary of a sample.

\[ \begin{aligned} \widehat \beta_0^* &= \text{estimated intercept for resample } (y_1, x_1)^*, (y_2, x_2)^*, \dots, (y_n, x_n)^*\\ \widehat \beta_1^* &= \text{estimated slope for resample } (y_1, x_1)^*, (y_2, x_2)^*, \dots, (y_n, x_n)^* \end{aligned} \]

Now, imagine doing this many times. That is, taking many bootstrap samples (say \(R = 1,000\)), each of size \(n\) individuals, and computing the regression intercept and slope for each bootstrap sample. You will obtain \(R\) bootstrap intercepts and \(R\) bootstrap slopes. Denote by \(\widehat \beta_{0}^{(5)}\) the bootstrap intercept in the 5th bootstrap sample. Similarly, \(\widehat \beta_{1}^{(5)}\) is the bootstrap slope in the 5th bootstrap sample. \[ \widehat \beta_{0}^{(1)}, \ \widehat \beta_{0}^{(2)}, \dots, \ \widehat \beta_{0}^{(R)} \\ \widehat \beta_{1}^{(1)}, \ \widehat \beta_{1}^{(2)}, \dots, \ \widehat \beta_{1}^{(R)} \] You can visualise the distribution of the \(R = 1,000\) bootstrap intercepts and slopes with histograms:

This is way too much math… Make it more concrete please!

Ok, this is more applied. Consider this as the original sample of \(n = 4\) individuals:

\[ \begin{matrix} \text{Individual, }i & \text{Response, }y & \text{Predictor, }x \\ 1 & 10.2 & 7 \\ 2 & 5.7 & 4 \\ 3 & 8.0 & 5 \\ 4 & 9.1 & 3 \\ \end{matrix} \]

or, more compactly, the original sample is:

\[ (10.2, 7), (5.7, 4), (8.0, 5), (9.1, 3) \]

A bootstrap sample is obtained by sampling 4 pairs from the original sample, with replacement. One such bootstrap sample could be: \[ (8.0, 5), (5.7, 4), (8.0, 5), (9.1, 3) \]

Then, fitting the linear model to these data, we can obtain a bootstrap intercept and bootstrap slope

y <- c(8, 5.7, 8, 9.1)
x <- c(5, 4, 5, 3)
coef(lm(y ~ x))

## (Intercept)           x 
##   8.9363636  -0.2909091

\[ \widehat \beta_0^* = 8.94 \qquad \widehat \beta_1^* = -0.29 \]

Now, imagine doing this many times. That is, taking many bootstrap samples (say \(R = 1,000\)), each of size \(n = 4\) individuals, and computing the regression intercept and slope for each bootstrap sample. You will obtain \(R\) bootstrap intercepts and \(R\) bootstrap slopes.

You can visualise the distribution of the bootstrap intercept and slopes with histograms:

That sounds all nice in theory! But how do I actually do this in R? Is it difficult???

No, it’s super easy! Follow these steps:

Step 1. Load the car library

library(car)

Step 2. Use the Boot() function (do not forget the uppercase B!) which takes as arguments:

the fitted model
f, saying which bootstrap statistics to compute on each bootstrap sample. By default f = coef, returning the regression coefficients.
R, saying how many bootstrap samples to compute. By default R = 999.
ncores, saying if to perform the calculations in parallel (and more efficiently). However, this will depend on your PC, and you need to find how many cores you have by running parallel::detectCores() on your PC. By default the function uses ncores = 1.

Step 3. Run the code. However, please remember that the Boot() function does not want a model which was fitted using data with NAs. In our case we are fine because we already removed them with na.omit.

boot_mdl <- Boot(mdl, R = 999)

Step 4. Look at the summary of the bootstrap results:

summary(boot_mdl)

## 
## Number of bootstrap replications R = 999 
##                     original    bootBias     bootSE    bootMed
## (Intercept)        2.7882487  1.8311e-03 0.05545034  2.7909036
## science_knowledge -0.0802763 -5.1728e-05 0.00442885 -0.0802218
## age                0.0023794 -2.5833e-05 0.00069909  0.0023376

The above output shows, for each regression coefficient, the value in the original sample in the column original, and then we will focus on the bootSE column, which estimates the variability of the coefficient from bootstrap sample to bootstrap sample. The bootSE provides us the bootstrap standard error, or bootstrap SE in short. We use this to answer the key question of how accurate our estimate is.

Step 5. Compute confidence intervals. Use your preferred confidence level, by default this is 95%:

Confint(boot_mdl, level = 0.95, type = "perc")

## Bootstrap percent confidence intervals
## 
##                       Estimate        2.5 %       97.5 %
## (Intercept)        2.788248712  2.682016848  2.898655860
## science_knowledge -0.080276256 -0.089240855 -0.071479349
## age                0.002379446  0.000962663  0.003786739

The type = "perc" argument tells R to return the values that comprise 95% of all values in between them, i.e. the value with 2.5% of observations below it and the value with 2.5% of observations above it and 97.5% of observations below it.

If you want to make it into a nice table:

Confint(boot_mdl, type = "perc") %>%
    kable(digits = 3, caption = 'Bootstrap 95% CIs') %>%
    kable_styling(full_width = FALSE)

Table 4: Bootstrap 95% CIs
	Estimate	2.5 %	97.5 %
(Intercept)	2.788	2.682	2.899
science_knowledge	-0.080	-0.089	-0.071
age	0.002	0.001	0.004

We are 95% confident that the population intercept is between 2.68 and 2.9. We are 95% confident that the population slope for science knowledge is between -0.09 and -0.07. We are 95% confident that the population slope for age is between 0.001 and 0.004.

The results in Table 4 report an estimate of the intercept (or constant) as equal to approximately 2.8. The constant of a multiple regression model can be interpreted as the average expected value of the dependent variable when all of the independent variables equal zero. In this case, the independent variable science knowledge has only a handful of respondents that score zero, and no one is aged zero, so the constant by itself does not tell us much. Researchers do not often have predictions based on the intercept, so it often receives little attention. A better choice would be to mean centre age, and refit the model with a mean centred age variable!

The estimated value for the slope coefficient linking knowledge to attitude is estimated to be approximately -0.08. This represents the average marginal effect of knowledge on attitude, and can be interpreted as the expected change in the dependent variable on average for a one-unit increase in the independent variable, controlling for age. In this example, every increase in quiz score by one point is associated with a decrease in attitude score of about –0.08, adjusted for age. Bearing in mind the valence of the question wording, this means that those who are more knowledgeable tend to be more favourable towards science – i.e. disagreeing with the statement.

The slope coefficient linking age to attitude is estimated to be approximately 0.002. This represents the average marginal effect of each additional year on attitude, and can be interpreted as the expected change in the dependent variable on average for a one-unit increase in the independent variable, controlling for science knowledge. For this example, that means that for every year older a person is, their attitude score is expected to increase by 0.002, controlling for science knowledge. This may seem like a very small effect, but remember that this is the effect of only one additional year. Bearing in mind the valence of the question wording, this means that older people tend to be less favourable towards science – i.e. agreeing with the statement.

The bootstrap confidence intervals table also reports that the 95% confidence intervals for both slope estimates do not include 0. This leads us to reject both null hypotheses at the 5% significance level, and conclude that there appear to be relationships for both age and science knowledge with attitude to science and faith.

Remember the summary of them model:

summary(mdl)

## 
## Call:
## lm(formula = attitude ~ science_knowledge + age, data = ebsurvey)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.88643 -0.97484  0.00913  0.88427  2.21489 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.7882487  0.0537759  51.849  < 2e-16 ***
## science_knowledge -0.0802763  0.0045095 -17.802  < 2e-16 ***
## age                0.0023794  0.0006792   3.503 0.000461 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.196 on 10500 degrees of freedom
## Multiple R-squared:  0.03223,    Adjusted R-squared:  0.03204 
## F-statistic: 174.8 on 2 and 10500 DF,  p-value: < 2.2e-16

The R-squared for the model is 0.031, which means that approximately 3% of the variance in attitude is explained by science knowledge and age.

How can we visualise the bootstrap intercepts and slopes? They are stored in the boot_mdl object.

This will show you the \(R = 999\) estimates, using head to show only the top 6.

head(boot_mdl$t)

##      (Intercept) science_knowledge          age
## [1,]    2.860584       -0.07946250 0.0009295118
## [2,]    2.729586       -0.07512774 0.0024392332
## [3,]    2.789480       -0.08029869 0.0025441840
## [4,]    2.749536       -0.07630859 0.0024819148
## [5,]    2.913322       -0.09170225 0.0012816341
## [6,]    2.803966       -0.07888200 0.0020855342

While this shows you the estimated intercept and slope in the original sample

boot_mdl$t0

##       (Intercept) science_knowledge               age 
##       2.788248712      -0.080276256       0.002379446

You can visualise the uncertainty in the estimates by plotting histograms either manually:

plot_data <- as_tibble(boot_mdl$t)
plot_data

## # A tibble: 999 x 3
##    `(Intercept)` science_knowledge      age
##            <dbl>             <dbl>    <dbl>
##  1          2.86           -0.0795 0.000930
##  2          2.73           -0.0751 0.00244 
##  3          2.79           -0.0803 0.00254 
##  4          2.75           -0.0763 0.00248 
##  5          2.91           -0.0917 0.00128 
##  6          2.80           -0.0789 0.00209 
##  7          2.76           -0.0803 0.00311 
##  8          2.75           -0.0787 0.00308 
##  9          2.69           -0.0733 0.00286 
## 10          2.77           -0.0795 0.00259 
## # … with 989 more rows

ggplot(plot_data, aes(science_knowledge)) +
    geom_histogram(color = 'white')

Or using the built-in function from the car package, which simply takes the bootstrap results boot_mdl:

hist(boot_mdl, ci = "perc", legend = "separate")

Presenting results

Question 8

Write a short paragraph summarising the results of this lab.

Include:

A brief description of which data were used to answer which questions.
A plot of your variables and summary statistics.
Describe your model and report results
Interpret results in context.

Solution

We used a subset of data from the 2005 Eurobarometer 63.1 survey to investigate whether:

There is a significant linear relationship between people’s age and their attitudes to science and faith after accounting for their scientific knowledge.
There is a significant linear relationship between people’s scientific knowledge and their attitudes to science and faith after accounting for their age.

The data, excluding any missing values, include 10503 individual respondents who were measured on 3 different attributes: (1) Score on a science “quiz” composed of 13 true/false items. (2) Attitude to science and faith (question wording: “We rely too much on science and not enough on faith”; responses on a 5-point scale from strongly disagree to strongly agree). (3) Age measured in years. Figure 1 displays the scatterplots of the pairwise relationships between the 3 variables. From the plot and the correlation matrix, displyed in Table 5, it appears that there is a weak negative linear association between attitude and science knowledge, and nearly no linear association between attitude and age.

Figure 1: Pairwise scatterplots of the variables.

Table 5: Correlation matrix
	science_knowledge	age	attitude
science_knowledge	1.000	-0.122	-0.176
age	-0.122	1.000	0.055
attitude	-0.176	0.055	1.000

To answer the research hypotheses, we fitted the following regression model: \[ \text{attitude} = \beta_0 + \beta_1 \ \text{science_knowledge} + \beta_2 \ \text{age} + \epsilon \]

Which resulted in the following estimated regression coefficients for the original sample: \[ \widehat{\text{attitude}} = 2.8 -0.08 \ \text{science_knowledge} + 0.0024 \ \text{age} \]

The model does not satisfy the regression assumptions, see Figure 2, and for this reason we will assess statistical significance using the boostrap approach with \(R = 999\) resamples.

Figure 2: Diagnostic plots

Table 6: Bootstrap 95% CIs
	Estimate	2.5 %	97.5 %
(Intercept)	2.788	2.682	2.899
science_knowledge	-0.080	-0.089	-0.071
age	0.002	0.001	0.004

The 95% bootstrap confidence intervals are provided in Table 6. These results show that there is a negative and statistically significant relationship between knowledge and attitudes. Specifically, the results show that for every additional correct quiz answer people give, we would expect a decline in attitude score of about 0.08. For age, the effect is in the opposite direction. For every additional year older a person is, they are expected to score .002 more on the attitude scale. The R-squared for the model is 0.032, which means that approximately 3% of the variance in attitude is explained by science knowledge and age. This leaves the majority of variation in attitudes unexplained by our model. Thus we conclude that respondents with greater knowledge about science also tend to be more positive about it, regardless of their age, while older people are slightly less positive, irrespective of their level of knowledge. Further diagnostic tests should be explored to evaluate the robustness of this finding.

References

Adapted from:

Allum, N. (2015). Learn about multiple regression in SPSS with data from the eurobarometer (63.1, jan–feb 2005). In SAGE Research Methods Datasets Part 1. SAGE Publications, Ltd.

Bootstrap for regression

Age, Knowledge and Attitudes About Science

The Data

The Bootstrap

Presenting results

References