Confidence intervals

Semester 2 - Week 1

1 Instructions

Register for your group on LEARN. Go to the course LEARN page, click Groups, click Labs_1_2_3, find your group and click Join.
Download the template Rmd file. Write your work in this file every week, and remember to save it often.
- Ensure your report title includes the group name: Group NAME.LETTER
- In the author section, ensure the report lists the exam numbers of all group members: B000001, B000002, …
Formative Report C spans the labs of the third block of DAPR1 (weeks 1-5 of semester 2).
- Report due date: 12 noon on Friday the 16th February 2024.
- The report should not include any reference to R code or functions, but be written or a generic reader who is only assumed to have a basic statistical understanding without any R knowledge. The main part of the report should only show text, figures, and tables. Appendix B, which is compulsory, will show all the R code used - keep reading for details.
- The submitted report must be a PDF file of max 6 sides of A4 paper.
- Keep the default settings in terms of Rmd knitting font and page margins.
- At the end of the file, you will place the appendices and these will not count towards the page limit.
  - You can include an optional appendix for additional tables and figures which you can’t fit in the main part of the report;
  - You must include a compulsory appendix listing all of the R code used in the report. The template Rmd file include code that does it for you.
- No extensions. As these are group-based submissions, no extensions will be given.

2 Some Helpful Formatting Resources

The following useful resources will help you with finalising the report formatting and knitting, prior to submission.

2.1 Checklist for successful knitting

If you encounter errors when knitting the Rmd file, go through the following checklist to try finding the source of the errors.

Successful knitting checklist

2.2 APA style

Check the following guide for reporting numbers and statistics in APA style (7th edition).

Numbers and statistics guide, APA style 7th edition

2.3 Hiding code and/or output

To not show the code of an R code chunk, and only show the output, write:

```{r, echo=FALSE}
# code goes here
```

To show the code of an R code chunk, but hide the output, write:

```{r, results='hide'}
# code goes here
```

To hide both text output and figures, use:

```{r, results='hide', fig.show='hide'}
# code goes here
```

To hide both code and output of an R code chunk, write:

```{r, include=FALSE}
# code goes here
```

2.4 Change figure height and/or width

To change a figure height or width in a PDF file, use one or both of the following code chunk options:

fig.height = ? to change the figure height. Try values such as 3, 4, 6, 7, 8 until you find a good one.
fig.width = ? to change the figure width. Depending on whether you have one panel or multiple panels, try values such as 3, 4, 8, 9, 10.

```{r figure-label, fig.heigth = 4, fig.width = 8, fig.cap = "Figure caption goes here"}
plt_histogram | plt_boxplot
```

3 Formative report C

In weeks 1-5 of semester 2 your group should produce a PDF report (using Rmarkdown) for which you will receive formative feedback in Flexible Learning Week (FLW).

3.1 Data and Goal

In this report you will investigate the mean graduation rate for female students at colleges and universities in the United States. To do so, you will be using the dataset ipeds-2012.csv, available at https://uoepsy.github.io/data/ipeds-2012.csv. This is a subset of data derived from the Integrated Postsecondary Education Data System (IPEDS) at the National Center for Education Statistics, 2012. The data were collected for a random sample from all colleges and universities in the United States in that year.

The variables include:

Variable	Description
type	College Type: −1 = Not reported; 1 = Public; 2 = Private for-profit; 3 = Private not-for-profit (no religious affiliation); 4 = Private not-for-profit (religious affiliation)
region	Region: 0 = US Service schools; 1 = New England; 2 = Mid East; 3 = Great Lakes; 4 = Plains; 5 = Southeast; 6 = Southwest; 7 = Rocky Mountains; 8 = Far West; 9 = Outlying areas
gradrate	Graduation Rate – All NA = missing
gradratem	Graduation Rate – Men NA = missing
gradratew	Graduation Rate – Women NA = missing

Over the next five weeks you will estimate for the mean graduation rate of female students in US universities in 2012, and test whether this value differs from a baseline rate of 50%.

3.2 This week’s task

Task C1

Read the data into R, describe the variable of interest both visually and numerically, and provide an estimate and 95% CI for the mean graduation rate of female students at US colleges and universities in 2012.

Sub-steps

Below there are sub-steps you need to consider to complete this week’s task.

Tip

To see the hints, hover your cursor on the superscript numbers.

Read the data into R and inspect it.¹
- How many units are there?
- What’s the type of each variable?
- Are there any missing values in some variables? What about in the variable of interest for your analysis?
Visualise the distribution of the variable of interest (gradratew). What is the shape of the distribution? Are there any outliers?²
Compute and interpret a table of descriptive statistics for the variable of interest. At a minimum, ensure that it includes both a measure of centre and spread.³
Compute a 95% confidence interval for the mean graduation rate of female college students in 2012.⁴

For the report introduction, write a brief introduction to the data and question being investigated. How many cases are there? Are there any impossible or missing values? What is the type of the variables and which one is used for the investigation?
Provide a write up of your results so far, using proper rounding and making sure to report your results in context of the investigation.

4 Worked Example

The Procrastination Assessment Scale for Students (PASS) was designed to assess how individuals approach decision situations, specifically the tendency of individuals to postpone decisions (Solomon & Rothblum, 1984).

The PASS assesses the prevalence of procrastination in six areas: writing a paper; studying for an exam; keeping up with reading; administrative tasks; attending meetings; and performing general tasks. For a measure of total endorsement of procrastination, responses to 18 questions (each measured on a 1-5 scale) are summed together, providing a single score for each participant (range 0 to 90). The mean score from Solomon & Rothblum, 1984 was 33.

Investigation:

What is the average procrastination score of Edinburgh University students?

To answer this question, we will use data collected for a random sample of students from the University of Edinburgh: https://uoepsy.github.io/data/pass_scores.csv

Variable Name	Description
sid	Subject identifier
school	School each subject belonged to
PASS	Total endorsement of procrastination score

Necessary packages:

tidyverse for using read_csv(), using summarise() and ggplot().
patchwork for arranging plots side by side or underneath
kableExtra for creating user-friendly tables

library(tidyverse)
library(patchwork)
library(kableExtra)

Read the data into R:

pass_scores <- read_csv("https://uoepsy.github.io/data/pass_scores.csv")
dim(pass_scores)

[1] 20  3

To inspect the data:

head(pass_scores)

# A tibble: 6 × 3
  sid   school       PASS
  <chr> <chr>       <dbl>
1 s_1   GeoSciences    31
2 s_2   ECA            24
3 s_3   LAW            32
4 s_4   ECA            40
5 s_5   LAW            28
6 s_6   SSPS           31

glimpse(pass_scores)

Rows: 20
Columns: 3
$ sid    <chr> "s_1", "s_2", "s_3", "s_4", "s_5", "s_6", "s_7", "s_8", "s_9", …
$ school <chr> "GeoSciences", "ECA", "LAW", "ECA", "LAW", "SSPS", "PPLS", "SLL…
$ PASS   <dbl> 31, 24, 32, 40, 28, 31, 30, 28, 32, 29, 28, 33, 35, 33, 30, 31,…

summary(pass_scores)

     sid               school               PASS      
 Length:20          Length:20          Min.   :24.00  
 Class :character   Class :character   1st Qu.:28.75  
 Mode  :character   Mode  :character   Median :31.00  
                                       Mean   :30.70  
                                       3rd Qu.:32.00  
                                       Max.   :40.00

Visualise the distribution of PASS scores:

Note

The boxplot highlights an outlier (40). However, this value is well within the plausible range of the scale (0 – 90), hence it is of no concern and the point can be kept for the analysis.

plt_hist <- ggplot(pass_scores, aes(x = PASS)) + 
    geom_histogram(color = 'white')

plt_box <- ggplot(pass_scores, aes(x = PASS)) + 
    geom_boxplot()

plt_hist / plt_box

Descriptive statistics:

stats <- pass_scores %>%
    summarise(n = n(),
              Min = min(PASS),
              Max = max(PASS),
              M = mean(PASS),
              SD = sd(PASS))

stats %>%
    kbl(booktabs = TRUE, digits = 2, 
        caption = "Descriptive statistics for PASS scores")

Descriptive statistics for PASS scores
n	Min	Max	M	SD
20	24	40	30.7	3.31

When estimating a parameter, in this case the mean score on the Procrastination Assessment Scale for Students (PASS) for all Edinburgh University students, we do not just report the estimate (sample average score), but also something that reflects our uncertainty in the estimate. This can either be the standard error or a confidence interval. If asked to compute a 95% confidence interval for the mean score on the Procrastination Assessment Scale for Students (PASS) for all Edinburgh University students, we could do:

# Sample mean
xbar <- stats$M

# Standard error
s <- stats$SD
n <- stats$n
se <- s / sqrt(n)
se

[1] 0.7401991

# Quantiles
tstar <- qt(c(0.025, 0.975), df = n - 1)
tstar

[1] -2.093024  2.093024

# CI
xbar + tstar * se

[1] 29.15075 32.24925

WARNING!

This code won’t work if stats stores a kable, i.e. the result of kbl(). Make sure this only stores the tibble, rather than the pretty version from kbl()!

Reporting

These three code chunks should not be visible in the report. You can simply report the CI in a paragraph using the style [LowerCI, UpperCI].

Example introduction

A random sample of 20 students from the University of Edinburgh completed a questionnaire measuring their total endorsement of procrastination. The data, available from https://uoepsy.github.io/data/pass_scores.csv, were used to estimate the average procrastination score of all Edinburgh University students. The recorded variables included a subject identifier (sid), the school of each subject (school), and the total score on the Procrastination Assessment Scale for Students (PASS). The data do not include any impossible values for the PASS scores, as they were all within the possible range of 0 – 90. To answer the question of interest, in the following we will only focus on the total PASS score variable.

Example CI interpretation

From the sample data we obtain an average procrastination score of $M = 30.7$, 95% CI [29.15, 32.25]. Hence, we are 95% confident that a Edinburgh University student will have a procrastination score between 29.15 and 32.25, which is between 0.75 and 3.85 lower than the average score of 33 reported by Solomon & Rothblum.

5 Student Glossary

To conclude the lab, add the new functions to the glossary of R functions.

Function	(package) and use
`geom_histogram`	(tidyverse) creates a histogram
`geom_boxplot`	(tidyverse) creates a boxplot
`summarise`	(tidyverse) compute a numerical summary of the data
`n()`	(tidyverse) count the rows. To be used inside `summary()`
`mean`	Compute the mean of a column
`sd`	Compute the standard deviation of a column
`qt`	Computes the quantile of a t distribution. For example, `qt(0.1, df = 21)` returns the value in a t(21) distribution that cuts a probability of 0.1 to its left

Footnotes

Hint: Some of the following functions may be useful: read_csv() from tidyverse, head(), glimpse(), summary(), nrow(DATA), dim(DATA), length(DATA$VARIABLE), DATA %>% summarise(n = n())↩︎
Hint: geom_histogram(), geom_density(), geom_boxplot() may be useful functions.
To get rid of NAs in a variable of interest, you can use DATA %>% drop_na(VARIABLE) or add na.rm as an argument to mean(), sd(), etc.
We don’t recommend using na.omit() on the entire dataset, as it would remove any row with NAs, even in variables not used for the current analysis.↩︎
Hint: summarise() from the tidyverse package or describe() from the psych package↩︎
Hints:

Step 1: Compute the average gradution rate of female college students

Step 2: Compute the standard error of the mean

Step 3: Compute the quantiles of a t distribution with $n-1$ degrees of freedom, where $n$ = sample size, cutting a probability of 0.95 in between them.

Step 4: Obtain the confidence interval using the formula:

\[95\% \text{ CI: } \left[ \bar x - t^* \ SE_{\bar{x}}, \ \bar x + t^* \ SE_{\bar{x}} \right]\] \[\text{where} \qquad SE_{\bar{x}} = \frac{s}{\sqrt n}\]↩︎