class: center, middle, inverse, title-slide .title[ #
Week 1: Introductions to R and Statistics
] .subtitle[ ## Univariate Statistics and Methodology using R ] .author[ ### USMR Team ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- class: inverse, center, middle # Part 1 # Why R? --- # What is R? .flex.items-center[.w-20.pa2[ ![](lecture_1_files/img/rlogo.png)] .w-80.pa2[ - **R** is a 'statistical programming language' - created mid-90s as a free version of **S** - widespread adoption since v2 (2004) ]] .flex.items-center[.w-80.pa2[ - **RStudio** is an 'integrated development environment' (IDE) - created 2011 'to improve **R** experience' - widespread adoption since 2012 ] .w-20.pa2[ ![](lecture_1_files/img/rstudiologo.png)]] --- # R vs RStudio ### This is R ```r model <- lm(RT ~ (age+freq+handedness)^2, data=words) summary(model) ``` -- .flex[.w-50[ ### This is RStudio ] .w-50[![](lecture_1_files/img/rstudio1.png)]] --- # RMarkdown .flex.items-center[.w-20.pa2[ ![](lecture_1_files/img/rmarkdown.png)] .w-80.pa2[ - **RMarkdown** is a 'text markup language' - created 2012 as a markup language for **R** - widespread adoption since 2015 ]] --- # RMarkdown .flex.w-100.bg-light-gray[ ``` ### About RMarkdown _This_ is some **RMarkdown**, which uses 'simple' codes to mark up text. - it can include R code like `r sqrt(2)` - it's simple to format things like bulleted lists + or even sublists ``` ] .pt4[ ### About RMarkdown _This_ is some **RMarkdown**, which uses 'simple' codes to mark up text. - it can include R code like 1.4142 - it's simple to format things like bulleted lists + or even sublists ] --- class: inverse, center, middle # What is R Good For? --- # Managing Datasets .center[ ![:scale 70%](lecture_1_files/img/rtable.png) ] --- # Doing Statistics ``` Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [glmerMod] Family: binomial ( logit ) Formula: DV ~ sc(FvO) * sc(EvC) + (1 | Code) + (0 + (sc(FvO) * sc(EvC)) | Code) + (1 | Item) Data: feminine Control: glmerControl(optimizer = "bobyqa") AIC BIC logLik deviance df.resid 879.3 943.6 -427.7 855.3 1558 ... Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.0566 1.1485 -0.92 0.35758 sc(FvO) 1.2453 0.3505 3.55 0.00038 *** sc(EvC) -0.0915 0.3080 -0.30 0.76638 sc(FvO):sc(EvC) 0.0221 0.6321 0.04 0.97207 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ... ``` --- # Publication-Quality Graphics .center[ ![](lecture_1_files/img/ultra-profiles.svg) ] --- # Data Visualisation .center[ ![:scale 80%](lecture_1_files/img/fbook_sm.jpg) ] .tr[.f6[ https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919/ ]] --- # RMarkdown: Books For example: https://bookdown.org/csgillespie/efficientR/ <img src="lecture_1_files/img/dapr-intro-book.png" width="75%" style="display: block; margin: auto;" /> --- # RMarkdown: Websites For example: https://rmarkdown.rstudio.com/ <img src="lecture_1_files/img/dapr-intro-website.png" width="75%" style="display: block; margin: auto;" /> --- ![:scale 30%](lecture_1_files/img/3logos.png) - USMR course materials (the readings, these lecture slides, etc) are all created in **RStudio**, using **RMarkdown** and **R** ![](lecture_1_files/img/lectures.png) --- # Online Interactive Visualisation For example: https://shiny.rstudio.com/gallery/movie-explorer.html <img src="lecture_1_files/img/dapr-intro-plot.png" width="75%" style="display: block; margin: auto;" /> --- count: false # Online Interactive Visualisation For example: https://gallery.shinyapps.io/086-bus-dashboard/ <img src="lecture_1_files/img/dapr-intro-dash.png" width="75%" style="display: block; margin: auto;" /> --- # R for Anything to do with Data .pull-left.pt4[ ### Pride and Prejudice ```r require(tm) require(wordcloud) # load "Pride and Prejudice" pp <- Corpus(DirSource('R/PP/')) pp <- tm_map(pp,stripWhitespace) pp <- tm_map(pp,tolower) pp <- tm_map(pp,removeWords, stopwords('english')) pp <- tm_map(pp,stemDocument) pp <- tm_map(pp,removePunctuation) pp <- tm_map(pp, PlainTextDocument) wordcloud(pp, scale=c(5,0.5), max.words=150, random.order=FALSE, rot.per=0.35, colors=brewer.pal(12,'Dark2')) ``` ] .pull-right[ ![](lecture_1_files/figure-html/pp-1.svg) ] ??? R is a multipurpose programming language with an emphasis on statistics. It can do all of the things that any statistics package can do and much more: Here, we're using it to visualise the frequencies with which words are used in Jane Austen's _Pride and Prejudice_. --- # The R Community .left-column[ ![](lecture_1_files/figure-html/pp-1.svg) ] .right-column[ - _someone else_ has done all the hard work to create wordclouds - released as libraries or **packages** (like `lme4` and `tidyverse`) - all I supplied was a text version of _Pride and Prejudice_ .pt3[ - **R** allows you to do _anything_ with data - if it's useful, chances are someone has already done it - useful things include statistics! ]] --- # The R Community - if it serves no purpose, chances are that someone's already done it too ```r library(cowsay) say("hello USMR") ``` ``` ## ## -------------- ## hello USMR ## -------------- ## \ ## \ ## \ ## |\___/| ## ==) ^Y^ (== ## \ ^ / ## )=*=( ## / \ ## | | ## /| | | |\ ## \| | |_|/\ ## jgs //_// ___/ ## \_) ## ``` --- # Why Use R? .pt4[ - because it's a _language_, I can easily show you what I did and you can copy it - because it's a _language_, statisticians can use it to implement leading-edge stats - because it's _free_, anyone can use it---and anyone can access your research - because it's _open source_, anyone can fix or improve `R` ] --- <!-- HERE HERE HERE HERE HERE --> # Devilish stuff .pull-left[ ## doing stats ![:scale 47%](lecture_1_files/img/SPSS_logo.svg) ] .pull-right[ ## coding ![:scale 47%](lecture_1_files/img/js_logo.png) ![:scale 47%](lecture_1_files/img/python-logo.svg) .tc.pt3[ **NB** all indices in `R` start at `1` ]] --- # Why use R?? .pull-left[ ![:scale 85%](lecture_1_files/img/articles.png) ] .pull-right[ ![:scale 90%](lecture_1_files/img/jobs_indeed.png) .tr.f7[ https://r4stats.com/articles/popularity ]] --- class: inverse, center, middle, animated, heartBeat # End of Part 1 --- class: inverse, center, middle # Part 2 ## Getting to Grips with R --- # Data in R - you can type **data** directly in to R ```r # a number 1.2 ``` ``` ## [1] 1.2 ``` ```r # characters (a string) "fáilte" ``` ``` ## [1] "fáilte" ``` - and you can do **operations** on data ```r 1.2 + 7 * 2 ``` ``` ## [1] 15.2 ``` ??? by "data" we really mean anything that is measured *outside* R and provided to R directly --- # Variables .left-column[ ![](lecture_1_files/img/playmog.jpg) ] .right-column[ - you can assign data to **variables** ```r bodyTemp <- 37.8 ``` - and use those variables ```r bodyTemp * (9/5) + 32 # to Fahrenheit ``` ``` ## [1] 100 ``` - **NB** spelling/capitalization matter ```r BodyTemp - 37 ``` ``` ## Error in eval(expr, envir, enclos): object 'BodyTemp' not found ``` ] --- # Statistics is about **groups** of things .flex.items-center[ .w-70.pa2[ ```r allTemps <- c(37.8, 0, 37.4) # vector maths allTemps * (9/5) + 32 ``` ``` ## [1] 100.04 32.00 99.32 ``` .pt2[ - note the **vectorization** of the calculation - R is designed from the bottom up to deal with groups ]] .w-30.pa2[ ![](lecture_1_files/img/playmo_tms.jpg) ]] --- # Not everything is a number .flex.items-center[ .w-70.pa2[ ```r allHair <- c("brown","white","black") allHair ``` ``` ## [1] "brown" "white" "black" ``` - these are called **character strings** + can be anything - **categories** (nominal data) are from a limited set + called **factors** in R ```r as.factor(allHair) ``` ``` ## [1] brown white black ## Levels: black brown white ``` ] .w-30.pa2[ ![](lecture_1_files/img/playmo_tms.jpg) ]] --- # Basic types of data (stats) .flex.items-center[ .w-70.pa2[ - **Nominal** ('names of things': e.g., hair colour) - **Ordinal** (order, no number: e.g., small-medium-large) - **Interval** (number without a true zero: e.g., body temp in ℃) - **Ratio** (number with a true zero: e.g., height) ] .w-30.pa2[ ![](lecture_1_files/img/playmo_tms.jpg) ]] --- # NOIR in R .flex.items-center[.w-40.pa1[
Type
R Variable Type
Nominal
character/factor
Ordinal
number
Interval
number
Ratio
number
] .w-60.pa1[ - nominal ```r allHair <- as.factor(c("brown", "white", "black")) allHair ``` ``` ## [1] brown white black ## Levels: black brown white ``` - interval ```r allTemps <- c(37.8, 0, 37.4) allTemps ``` ``` ## [1] 37.8 0.0 37.4 ``` ]] -- .flex.items-center[ .w-5.pa1[ ![:scale 70%](lecture_1_files/img/danger.svg) ] .w-95.pa1[ - ordinal data can also be represented as **ordered factors** (`as.ordered()`) ]] ??? this is the first time I've used this symbol, which means "dangerous bend in the road" I'm going to use it when there's something additional that you really don't need to know but I can't help myself telling you --- # Break it down ```r allHair <- c("brown","white","black") ``` .flex.items-center[ .w-30.f3[ `allHair` ] .w-70[ - **variable** (can be anything that isn't _reserved_) ]] .flex.items-center[ .w-30.f3[ `<-` ] .w-70[ - **assignment** ("goes in to") ]] .flex.items-center[ .w-30.f3[ `c()` ] .w-70[ - **function** (`c()` _combines_ its **arguments**) ]] .flex.items-center[ .w-30.f3[ `"brown"` ] .w-70[ - **character** (arbitrary sequence of symbols) ]] --- # Dataframes .flex.items-center[ .w-70.pa2[ - data can be grouped into a **dataframe** - each _line_ represents one set of observations - each _column_ represents one type of information + (a bit like a spreadsheet) ```r people <- data.frame(name=c('Johanna','Casper','Steve'), temp=allTemps, hair=as.factor(allHair), height=c(132,205,181)) people ``` ``` ## name temp hair height ## 1 Johanna 37.8 brown 132 ## 2 Casper 0.0 white 205 ## 3 Steve 37.4 black 181 ``` ] .w-30.pa2[ ![](lecture_1_files/img/playmo_tms.jpg) ]] ??? - note that I've made _hair_ a factor but left _names_ as character strings. - that's because, in my world, there is a finite set of hair colours, but a name can be anything: it's an arbitrary label. - you'll see the effect of this in the next slide. --- # Can you run an **function** on a **dataframe**? - youbetcha! ```r summary(people) ``` ``` ## name temp hair height ## Length:3 Min. : 0.0 black:1 Min. :132 ## Class :character 1st Qu.:18.7 brown:1 1st Qu.:156 ## Mode :character Median :37.4 white:1 Median :181 ## Mean :25.1 Mean :173 ## 3rd Qu.:37.6 3rd Qu.:193 ## Max. :37.8 Max. :205 ``` - or on a vector ```r mean(people$temp) # just the temp column from people ``` ``` ## [1] 25.07 ``` ??? note the difference between the name column ("there are some strings") and the hair column ("I know there are categories: this is how many of each you have") --- # We know a little about R - we've seen some R code - we know about basic data types - we know what variables are - we've seen vectors, and dataframes - we've seen a couple of examples of functions ??? time for a break! --- class: inverse, center, middle, animated, heartBeat # End of Part 2 --- class: inverse, center, middle # Part 3 .pt3[ ![:scale 50%](lecture_1_files/img/Two_red_dice_01.svg) ] ??? in part 3 we're going to look at probability, starting with a couple of dice --- ## How likely are you to throw 12 with two dice? .left-column[ ![:scale 80%](lecture_1_files/img/two_sixes.svg) ] .right-column[ - pretty easy to work out - one-in-six chance of throwing a six - one-in-six chance of throwing a second six + NB., these observations are _independent_ + (wouldn't matter if you threw one dice twice or two dice together) - `\(\frac{1}{36}\)` chance of throwing two sixes ] --- # Are my dice fair? - one way to find out: throw two dice many times and count the outcomes .center[ ![](lecture_1_files/figure-html/justg-1.svg)<!-- --> ] --- # What would fair dice look like? .pull-left[ ![](lecture_1_files/img/playmo_lazy.jpg) ] .pull-right[ - we need a lot of throws - first rule of coding: be lazy - let the computer do the work ] --- # Using RStudio .center[ ![:scale 80%](lecture_1_files/img/rstudio.png) ] --- count: false # Using RStudio .center[ ![:scale 80%](lecture_1_files/img/rstudio_zones.png) ] --- class: center, middle ## create some dice ??? - show creating a project - markdown; call project **dice** - first show `1:6` - then `sample(1:6, 1)` - then help for `sample()` - then `sample(1:6, 2, replace=T)` - then `sum(sample(1:6,2,replace=T))` - then make a `dice(num=1)` function which returns the sum, for default 1 --- # Now we can throw dice a _lot_ of times ```r dice <- function(num=1) { sum(sample(1:6, num, replace=TRUE)) } dice() ``` ``` ## [1] 1 ``` -- ```r dice(2) ``` ``` ## [1] 7 ``` --- # Throw two dice many times ```r replicate(250,dice(2)) ``` ``` ## [1] 11 7 7 4 10 9 7 6 12 3 4 9 6 11 10 6 7 8 9 9 7 10 8 6 12 ## [26] 10 7 6 7 7 8 11 3 5 6 11 3 4 7 9 7 11 3 6 8 8 10 8 8 6 ## [51] 3 4 7 7 10 9 9 7 5 11 7 8 5 3 9 5 4 3 8 4 11 10 5 10 6 ## [76] 7 6 11 5 7 7 10 5 5 2 6 7 5 4 7 5 9 8 6 6 9 7 9 8 6 ## [101] 6 8 6 4 5 5 6 4 10 9 3 4 9 5 6 5 9 3 8 5 4 7 6 5 7 ## [126] 8 11 9 10 7 8 6 6 12 8 3 7 7 9 6 4 4 6 5 7 7 2 5 9 12 ## [151] 6 11 6 7 4 5 5 8 6 8 11 7 5 7 6 4 7 11 3 5 2 7 8 8 7 ## [176] 10 6 7 7 3 3 5 4 3 4 11 3 9 7 8 7 10 9 9 6 10 7 7 6 10 ## [201] 9 5 9 9 7 5 7 9 3 7 5 5 3 6 7 5 4 3 8 6 11 11 6 9 7 ## [226] 10 9 6 4 8 10 7 8 4 5 8 8 7 9 9 11 11 6 5 9 10 11 9 7 9 ``` -- - ...and record the result ```r d <- replicate(250,dice(2)) ``` ??? - actually, `d` won't contain the same numbers as you see on the slide --- # Make a graph ```r table(d) ``` ``` ## d ## 2 3 4 5 6 7 8 9 10 11 12 ## 7 13 21 22 33 43 29 32 30 16 4 ``` --- count: false # Make a graph ```r barplot(table(d)) ``` ![](lecture_1_files/figure-html/dice5a-1.svg)<!-- --> --- # Many more throws ```r d <- replicate(10000,dice(2)) barplot(table(d)) ``` ![](lecture_1_files/figure-html/dice6-1.svg)<!-- --> --- # 10,000 dice throws .flex.items-top[ .w-20.pa2[ ![](lecture_1_files/figure-html/dice6-1.svg) ] .w-80.pa2[ - we can work out the proportion of throws that summed to 12 ```r sum(d == 12) / 10000 ``` ``` ## [1] 0.0281 ``` - and we know what that proportion should be if the dice are fair ```r 1/36 ``` ``` ## [1] 0.02778 ``` ]] --- # Some more (fake) dice throws .center[ ![](lecture_1_files/figure-html/fdice-1.svg)<!-- --> ] .br3.center.pa2.bg-green.white.f4[ are the patterns from the dice _different enough_ from what we would expect from fair dice for us to conclude that they're unfair? ] --- # Statistical questions - so the million-dollar question is a _negative_ question .br3.center.pa2.pt4.bg-green.white.f4[ are we dissatisfied with the suggestion that the pattern of results we have observed should be attributed to chance? ] .pt2[ - if we are, then maybe we can persuade you of a different explanation - but note that the different explanation is not _proven_, it's _suggested_ ] --- class: inverse, center, middle, animated, heartBeat # End --- # Acknowledgements - icons by Diego Lavecchia from the [Noun Project](https://thenounproject.com/)