In this document we are going to work through the “hand” calculations for a simple linear model.

Packages

For this example we will only need the tidyverse package.

library(tidyverse)

Data

Below is code to create the data we used in lecture. As this is a small example data set, we can make the tibble manually, rather than import it. The structure is pretty simple. On the left of the = we name the variable, on the right we enter the data.

test <- tibble(
  student = paste(rep("id",10),1:10, sep=""),
  hours = seq(0.5,5,.5),
  score = c(1,3,1,2,2,6,3,3,4,8)
)

The tibble function is pretty smart, and it defaults to setting the type of variable based on how you enter values. glimpse allows us to look at the structure of our dataset.

glimpse(test)
## Rows: 10
## Columns: 3
## $ student <chr> "id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8", "id9",…
## $ hours   <dbl> 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0
## $ score   <dbl> 1, 3, 1, 2, 2, 6, 3, 3, 4, 8

So we can see student is being recognised as a character vector, and our variable hours and score variables as dbl or numeric.

Calculating slope

From lecture, we can calculate the slope using the following:

\[ \hat \beta_1 = \frac{SP_{xy}}{SS_x} \]

Where \(SP_{xy}\) = sum of cross-products:

\[ SP_{xy} = \sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y}) \]

and \(SS_x\) = sums of squared deviations of \(x\):

\[ SS_x = \sum_{i = 1}^{n}(x_i - \bar{x})^2 \]

Sum of cross-products

\[ SP_{xy} = \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}) \]

Let’s break down what this is doing. The \(\sum_{i=1}^{n}\) is telling us that we need to perform the action that follows for each individual (\(i\)) for all rows of data (\(n\)).

The two terms in parentheses require us to subtract the mean of \(x\) (\(\bar{x}\)) and the mean of \(y\) (\(\bar{y}\)) from the individual values of \(x\) and \(y\) for each row of data or participant. We know this because these are subscripted with \(i\). In other words, for each person, we are calculating the deviation of their individual scores from the means.

We then multiply the deviations for each person. Once we have these for everyone, we sum them up.

What we are going to do now is add columns to our data set that represent the steps in this calculation. In our example, \(x\) = hours and \(y\) = score. First, lets add both means.

calc <- test %>%
  mutate(
    m_hours = mean(hours),
    m_score = mean(score)
  ) %>%
  select(1,2,4,3,5)

calc
## # A tibble: 10 × 5
##    student hours m_hours score m_score
##    <chr>   <dbl>   <dbl> <dbl>   <dbl>
##  1 id1       0.5    2.75     1     3.3
##  2 id2       1      2.75     3     3.3
##  3 id3       1.5    2.75     1     3.3
##  4 id4       2      2.75     2     3.3
##  5 id5       2.5    2.75     2     3.3
##  6 id6       3      2.75     6     3.3
##  7 id7       3.5    2.75     3     3.3
##  8 id8       4      2.75     3     3.3
##  9 id9       4.5    2.75     4     3.3
## 10 id10      5      2.75     8     3.3

We use mutate to create new variables and add them to our data set. Notice for both m_hours and m_score the values in each row are the same. That is because there is only one mean. By default, mutate includes this for every row. I have used select to reorder the columns so the variable containing the values and the mean are next to one another. The numbers refer to the column numbers.

Next, we can calculate the deviances:

calc <- calc %>%
  mutate(
    dev_hours = hours - mean(hours),
    dev_score = score - m_score
  ) %>%
  select(1:3, 6, 4,5,7)
calc
## # A tibble: 10 × 7
##    student hours m_hours dev_hours score m_score dev_score
##    <chr>   <dbl>   <dbl>     <dbl> <dbl>   <dbl>     <dbl>
##  1 id1       0.5    2.75     -2.25     1     3.3    -2.3  
##  2 id2       1      2.75     -1.75     3     3.3    -0.300
##  3 id3       1.5    2.75     -1.25     1     3.3    -2.3  
##  4 id4       2      2.75     -0.75     2     3.3    -1.3  
##  5 id5       2.5    2.75     -0.25     2     3.3    -1.3  
##  6 id6       3      2.75      0.25     6     3.3     2.7  
##  7 id7       3.5    2.75      0.75     3     3.3    -0.300
##  8 id8       4      2.75      1.25     3     3.3    -0.300
##  9 id9       4.5    2.75      1.75     4     3.3     0.7  
## 10 id10      5      2.75      2.25     8     3.3     4.7

Again, we calculated two new variables and reordered the data set.

The last step is to multiply the two deviances for each person.

calc <- calc %>%
  mutate(
    SPXYi = dev_hours * dev_score
  )
calc
## # A tibble: 10 × 8
##    student hours m_hours dev_hours score m_score dev_score  SPXYi
##    <chr>   <dbl>   <dbl>     <dbl> <dbl>   <dbl>     <dbl>  <dbl>
##  1 id1       0.5    2.75     -2.25     1     3.3    -2.3    5.18 
##  2 id2       1      2.75     -1.75     3     3.3    -0.300  0.525
##  3 id3       1.5    2.75     -1.25     1     3.3    -2.3    2.88 
##  4 id4       2      2.75     -0.75     2     3.3    -1.3    0.975
##  5 id5       2.5    2.75     -0.25     2     3.3    -1.3    0.325
##  6 id6       3      2.75      0.25     6     3.3     2.7    0.675
##  7 id7       3.5    2.75      0.75     3     3.3    -0.300 -0.225
##  8 id8       4      2.75      1.25     3     3.3    -0.300 -0.375
##  9 id9       4.5    2.75      1.75     4     3.3     0.7    1.23 
## 10 id10      5      2.75      2.25     8     3.3     4.7   10.6

And there we go. Our SPXYi variable contains the product of the deviations of each observation from the means of hours studied and test score. Our final step is just to add this up.

SPXY <- sum(calc$SPXYi)
SPXY
## [1] 21.75

Sum of square deviations

So now to calculate \(SS_x\). Let’s look at the equation again:

\[ SS_x = \sum_{i=1}^{n}(x_i - \bar{x})^2 \]

What do you notice?

Well, we have already calculated the individual deviations \((x_1 - \bar{x}), \dots, (x_n - \bar{x})\). They are the column called dev_hours in the calc table. So all we need to do is square these for each observation and sum them.

calc <- calc %>%
  mutate(
    SSxi = dev_hours^2
  )
calc
## # A tibble: 10 × 9
##    student hours m_hours dev_hours score m_score dev_score  SPXYi   SSxi
##    <chr>   <dbl>   <dbl>     <dbl> <dbl>   <dbl>     <dbl>  <dbl>  <dbl>
##  1 id1       0.5    2.75     -2.25     1     3.3    -2.3    5.18  5.06  
##  2 id2       1      2.75     -1.75     3     3.3    -0.300  0.525 3.06  
##  3 id3       1.5    2.75     -1.25     1     3.3    -2.3    2.88  1.56  
##  4 id4       2      2.75     -0.75     2     3.3    -1.3    0.975 0.562 
##  5 id5       2.5    2.75     -0.25     2     3.3    -1.3    0.325 0.0625
##  6 id6       3      2.75      0.25     6     3.3     2.7    0.675 0.0625
##  7 id7       3.5    2.75      0.75     3     3.3    -0.300 -0.225 0.562 
##  8 id8       4      2.75      1.25     3     3.3    -0.300 -0.375 1.56  
##  9 id9       4.5    2.75      1.75     4     3.3     0.7    1.23  3.06  
## 10 id10      5      2.75      2.25     8     3.3     4.7   10.6   5.06

and

SSx = sum(calc$SSxi)
SSx
## [1] 20.625

Return to the slope

We now have the two values we need to calculate the slope:

\[ \hat \beta_1 = \frac{SP_{xy}}{SS_x} = \frac{21.75}{20.625} = 1.054545 \]

beta1_hat <- SPXY/SSx
beta1_hat
## [1] 1.054545

Now to the intercept.

Calculating intercept

Calculations for intercept:

\[ \hat \beta_0 = \bar{y} - \hat \beta_1 \bar{x} \]

So this one is much easier as all we need are:

So…

beta0_hat <- mean(calc$score) - (beta1_hat * mean(calc$hours))
beta0_hat
## [1] 0.4

Keep a note of these values.

test %>%
  ggplot(., aes(x=hours, y=score)) +
  geom_point() +
  geom_smooth(method = lm, 
              se = F, 
              colour = "pink")
## `geom_smooth()` using formula = 'y ~ x'

m1 <- lm(score ~ hours, data = test)
summary(m1)
## 
## Call:
## lm(formula = score ~ hours, data = test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6182 -1.0773 -0.7454  1.1773  2.4364 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.4000     1.1111   0.360   0.7282  
## hours         1.0545     0.3581   2.945   0.0186 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.626 on 8 degrees of freedom
## Multiple R-squared:  0.5201, Adjusted R-squared:  0.4601 
## F-statistic:  8.67 on 1 and 8 DF,  p-value: 0.01858