For Loops & Bootstrapping1 / 38

The Problem

Copying and pasting code is not an efficient use of your time
Often you want to do the same thing but with different inputs:
- Get the mean of a variable for 20 different groups
- Plot different variables as your x-axis against the exact same y-axis
- Do the exact same analysis and make the same plots for the levels of an independent variable (e.g., patients and controls)

2 / 38

The Problem

Copying and pasting code is not an efficient use of your time
Often you want to do the same thing but with different inputs:
- Get the mean of a variable for 20 different groups
- Plot different variables as your x-axis against the exact same y-axis
- Do the exact same analysis and make the same plots for the levels of an independent variable (e.g., patients and controls)
In sum, you're trying to iterate (perform repeatedly)

2 / 38

How do we address this?

In this class, we are going to talk specifically about for loops. Why?

They are general purpose. You will find them in nearly every single programming language. If you decide to not use R and instead go to Python or Matlab or whatever, you'll still come across them.
This is a fundamental component of programming. I would guess that you learn this within the first 2 weeks of CS 131.

3 / 38

How do we address this?

In this class, we are going to talk specifically about for loops. Why?

They are general purpose. You will find them in nearly every single programming language. If you decide to not use R and instead go to Python or Matlab or whatever, you'll still come across them.
This is a fundamental component of programming. I would guess that you learn this within the first 2 weeks of CS 131.

There are other ways to do this that are R-specific. We are NOT going to cover these (counterintuitive, I know).

the apply family of functions, including lapply
using the purr package from the tidyverse

3 / 38

Lists

Lists are basically vectors but where every element can be a totally different data class. Ex:

myList <- list(6, head(iris), "hello world", c(1, 3, 5, 7, 9))
myList

## [[1]]
## [1] 6
## 
## [[2]]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 
## [[3]]
## [1] "hello world"
## 
## [[4]]
## [1] 1 3 5 7 9

4 / 38

Lists

You can also name the elements in your list:

myList <- list(Numberz = 6, DFs = head(iris), Chars = "hello world", VectorKing = c(1, 3, 5, 7, 9))
myList

## $Numberz
## [1] 6
## 
## $DFs
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 
## $Chars
## [1] "hello world"
## 
## $VectorKing
## [1] 1 3 5 7 9

5 / 38

Lists

The weirdest thing about lists is accessing the elements within a list. Think of it like a book:

The element itself is like the chapter of a book
But there's a page that just says "Chapter 4" but doesn't contain any text
You need to tell R to go to the actual chapter

6 / 38

Lists

The weirdest thing about lists is accessing the elements within a list. Think of it like a book:

The element itself is like the chapter of a book
But there's a page that just says "Chapter 4" but doesn't contain any text
You need to tell R to go to the actual chapter

We can do this with double brackets 👇

myList[1]

## $Numberz
## [1] 6

myList[[1]]

## [1] 6

6 / 38

Lists

Why does this matter? Say we wanted to take our number 6 from our list and add it to the number 12

myList[1] + 12

## Error in myList[1] + 12: non-numeric argument to binary operator

myList[[1]] + 12

## [1] 18

7 / 38

Lists are weird

From Hadley Wickham, creator of tidyverse

8 / 38

Now on to `for loops`

for (i in 1:some number) {
  do something
}

"For each element in 1 through _, perform some function"

"Perform the function contained in this for loop on every single element in some list of elements"

9 / 38

`for loops`

for (i in 1:some number) {
  do something
}

The important parts:

You can iterate over anything: rows of a data.frame, columns of a data.frame, lists, or simple 1-dimensional vectors. The 1:some number portion are the elements you’re iterating over. If you wanted to do something 5 times, you could say 1:5. Most of the time, we don’t know that second number, though. So we can use a function. 1:nrow(data.frame) or 1:length(list) or 1:length(vector).

10 / 38

`for loops`

for (i in 1:some number) {
  do something
}

The important parts:

You can iterate over anything: rows of a data.frame, columns of a data.frame, lists, or simple 1-dimensional vectors. The 1:some number portion are the elements you’re iterating over. If you wanted to do something 5 times, you could say 1:5. Most of the time, we don’t know that second number, though. So we can use a function. 1:nrow(data.frame) or 1:length(list) or 1:length(vector).
The i stands for “each” and is the same type of i seen in equations. The top line then reads: “For each item/element in 1 through some number”.

10 / 38

`for loops`

for (i in 1:some number) {
  do something
}

The important parts:

You can iterate over anything: rows of a data.frame, columns of a data.frame, lists, or simple 1-dimensional vectors. The 1:some number portion are the elements you’re iterating over. If you wanted to do something 5 times, you could say 1:5. Most of the time, we don’t know that second number, though. So we can use a function. 1:nrow(data.frame) or 1:length(list) or 1:length(vector).
The i stands for “each” and is the same type of i seen in equations. The top line then reads: “For each item/element in 1 through some number”.
The part between the curly brackets { } is what you want to do (it’s the body of the for loop).

10 / 38

A Simple Example

For each number in 1 through 10, print the following: "(number) squared is (that number squared)"

So for the number 2, the output should be "2 squared is 4"

The functions we are going to use are print() and paste0().

for (i in 1:10) {
  print(paste0(i, " squared is ", i^2))
}

## [1] "1 squared is 1"
## [1] "2 squared is 4"
## [1] "3 squared is 9"
## [1] "4 squared is 16"
## [1] "5 squared is 25"
## [1] "6 squared is 36"
## [1] "7 squared is 49"
## [1] "8 squared is 64"
## [1] "9 squared is 81"
## [1] "10 squared is 100"

11 / 38

Storing the output

Is there any way for me to access any of the squared numbers from above? No!

If you want to store the output of a loop (which in almost all cases we do), you need to initialize an empty object. This means make a blank object before running your loop that will contain your stored results.

12 / 38

Storing the output

Is there any way for me to access any of the squared numbers from above? No!

Let's run the same loop as last time, but this time let's store the squared numbers, rather than just printing out some lines.

squaredNumbers <- NULL
for (i in 1:10) {
  squaredNumbers[i] <- i^2
}
squaredNumbers

##  [1]   1   4   9  16  25  36  49  64  81 100

12 / 38

Recap

The basic steps of constructing a for loop:

Figure out what it is that you want to iterate through (numbers, columns of a data.frame, data.frames, a list, etc.)
Think about what you want the output to look like, and initialize an empty object that can store the output
Type out all of the steps within the body of the loop

13 / 38

Applied Examples

What if you want to run the following models where the Y dependent variable stays the same, but the X independent variable changes? For this first example, let's say we only care about the $p$ -value.

lm(Sepal.Length ~ Sepal.Width, data = iris)
lm(Sepal.Length ~ Petal.Length, data = iris)
lm(Sepal.Length ~ Petal.Width, data = iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

14 / 38

Applied Examples

Thinking through what we need to happen:

We're iterating through specific columns of the iris data.frame -- so we need to get those column names into a form we can iterate through.
The output is a vector that contains $p$ -values only
The body of the loop needs to contain the lm() functions

15 / 38

Applied Example 1

varsToIterate <- colnames(iris)[2:4]
sigValues <- NULL
for (i in 1:length(varsToIterate)) {
  model <- lm(Sepal.Length ~ iris[,i], data = iris)
  model <- tidy(model)
  sig <- model[[2,5]]
  sigValues[i] <- sig
}
sigValues

## [1] 0.000000e+00 1.518983e-01 1.038667e-47

16 / 38

Applied Example 1

You can make comments within for loops

# get the column names in a format that we can iterate through
varsToIterate <- colnames(iris)[2:4]
# initialize an empty output vector
sigValues <- NULL
for (i in 1:length(varsToIterate)) {
  # run the model where i is the column that is varying
  model <- lm(Sepal.Length ~ iris[,i], data = iris)
  # the tidy function comes from the `broom` package
  model <- tidy(model)
  # find the p-value from the tidied model
  sig <- model[[2,5]]
  # store the p-value in the output
  sigValues[i] <- sig
}
# print the output
sigValues

## [1] 0.000000e+00 1.518983e-01 1.038667e-47

17 / 38

Applied Example 2

Now, what if you want to store the entire output of the model, not just the $p$ -value? The output of tidy(model) is a data.frame. So the result of our loop will now be a list of data.frames, rather than a vector of $p$ -values. Note the double brackets in modelList[[i]]!

varsToIterate <- colnames(iris)[2:4]
modelList <- list()
for (i in 1:length(varsToIterate)) {
  model <- lm(Sepal.Length ~ iris[,i], data = iris)
  model <- tidy(model)
  modelList[[i]] <- model
}

18 / 38

Applied Example 2

print(modelList)

## [[1]]
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)        0  3.79e-17   0             1
## 2 iris[, i]          1  6.43e-18   1.56e17       0
## 
## [[2]]
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    6.53      0.479     13.6  6.47e-28
## 2 iris[, i]     -0.223     0.155     -1.44 1.52e- 1
## 
## [[3]]
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    4.31     0.0784      54.9 2.43e-100
## 2 iris[, i]      0.409    0.0189      21.6 1.04e- 47

19 / 38

Applied Example 2.1

Let's modify this slightly so that each output data.frame has a name associated with it (rather than 1, 2, 3)

varsToIterate <- colnames(iris)[2:4]
modelList <- list()
for (i in 1:length(varsToIterate)) {
  name <- paste0(varsToIterate[i])
  model <- lm(Sepal.Length ~ iris[,i], data = iris)
  model <- tidy(model)
  modelList[[name]] <- model
}

20 / 38

Applied Example 2.1

print(modelList)

## $Sepal.Width
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)        0  3.79e-17   0             1
## 2 iris[, i]          1  6.43e-18   1.56e17       0
## 
## $Petal.Length
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    6.53      0.479     13.6  6.47e-28
## 2 iris[, i]     -0.223     0.155     -1.44 1.52e- 1
## 
## $Petal.Width
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    4.31     0.0784      54.9 2.43e-100
## 2 iris[, i]      0.409    0.0189      21.6 1.04e- 47

21 / 38

Applied Example 3

Let's do the same thing but with some modifications:

Let's also plot X & Y so we have a figure that corresponds with each model
Instead of the output being a list of the different models, what if we wanted the information from all of the models to be contained within a data.frame?

22 / 38

Applied Example 3

Let's do the same thing but with some modifications:

Let's also plot X & Y so we have a figure that corresponds with each model
Instead of the output being a list of the different models, what if we wanted the information from all of the models to be contained within a data.frame?

When it comes to looping through plots, there are a few odd things:

the sym() function will take the quotes off of a string so that it can be evaluated properly
!! says "actually evaluate what a variable stands for". This will make sense when you see it in the code.

22 / 38

Applied Example 3

We want to store the model outputs AND a list of plots. So we need to initialize 2 things

modelDF <- data.frame() # for models
plotList <- list() # for plots
for (i in 1:length(varsToIterate)) {
  nameX <- paste0(varsToIterate[i]) # a character string for labels
  nameY <- "Sepal.Length" # doesn't change in the loop!
  # make the models
  model <- lm(Sepal.Length ~ iris[,i], data = iris)
  model <- tidy(model)
  # add a column that repeats whatever nameX is
  # for us, this will make it easier to keep track of what "i" is
  model$predictor <- rep(nameX, times = nrow(model))
  # now bind the current model underneath the previous model
  # so that it's all contained within the same data.frame
  modelDF <- rbind(modelDF, model)
  # now make the plots
  nameXplot <- sym(varsToIterate[i])
  plotList[[i]] <- ggplot(data = iris, 
                          aes(x = !! nameXplot,
                              y = Sepal.Length)) +
    geom_point(color = "cornflowerblue") +
    labs(title = paste0(nameX, " by ", nameY),
         x = nameX,
         y = nameY)
}

23 / 38

Applied Example 3

modelDF

## # A tibble: 6 x 6
##   term        estimate std.error statistic   p.value predictor   
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl> <chr>       
## 1 (Intercept)    0      3.79e-17   0       1   e+  0 Sepal.Width 
## 2 iris[, i]      1      6.43e-18   1.56e17 0         Sepal.Width 
## 3 (Intercept)    6.53   4.79e- 1   1.36e 1 6.47e- 28 Petal.Length
## 4 iris[, i]     -0.223  1.55e- 1  -1.44e 0 1.52e-  1 Petal.Length
## 5 (Intercept)    4.31   7.84e- 2   5.49e 1 2.43e-100 Petal.Width 
## 6 iris[, i]      0.409  1.89e- 2   2.16e 1 1.04e- 47 Petal.Width

24 / 38

Applied Example 3

## [[1]]

## 
## [[2]]

## [[1]]

25 / 38

Bootstrapping

to get oneself out of a situation using existing resources

26 / 38

Bootstrapping

to get oneself out of a situation using existing resources

In statistics...any test or metric that uses random sampling with replacement

26 / 38

Bootstrapping

to get oneself out of a situation using existing resources

In statistics...any test or metric that uses random sampling with replacement

a bootstrapped mean
bootstrapped confidence intervals
bootstrap anything your heart desires!

26 / 38

Bootstrapping

to get oneself out of a situation using existing resources

In statistics...any test or metric that uses random sampling with replacement

a bootstrapped mean
bootstrapped confidence intervals
bootstrap anything your heart desires!

SAMPLING WITH REPLACEMENT

26 / 38

BootstrappingAlternative to traditional NHST (considered a "resampling method")
Previously, we had to use equations, formulas, and a lot of assumptions to get our sampling distribution
But what if we don't know the theoretical sampling distribution or we can't verify it? 
27 / 38

Bootstrapping

Alternative to traditional NHST (considered a "resampling method")
Previously, we had to use equations, formulas, and a lot of assumptions to get our sampling distribution
But what if we don't know the theoretical sampling distribution or we can't verify it?
We build the sampling distribution empirically by random sampling with replacement from the sample
Easier to interpret and more robust because we're not relying on a bunch of assumptions/equations to estimate our sampling distribution -- we actually created the sampling distribution instead!
- robust in statistics can mean different things. Here we're saying that it doesn't matter if a small thing changes in the dataset, because the bootstrapped mean will be resilient to those small changes.

27 / 38

Illustration

Imagine you had a sample of 6 people that happen to live in NYC in the mid 1990's. Let's just say their names are Rachel, Monica, Phoebe, Joey, Chandler, and Ross. Maybe we have the height of each of these 6 individuals (in inches). But now let's say we want to get a sense of the average height in a person in their mid 20's in the 1990's in NYC. We can bootstrap the mean height. That is, we are going to draw from this group many samples of 6 people with replacement, each time calculating the average height of the sample.

## [1] "Monica"   "Chandler" "Phoebe"   "Monica"   "Chandler" "Phoebe"

## [1] 68.33333

## [1] "Monica" "Rachel" "Ross"   "Phoebe" "Phoebe" "Ross"

## [1] 68.66667

## [1] "Monica"   "Ross"     "Chandler" "Joey"     "Chandler" "Ross"

## [1] 70.83333

## [1] "Phoebe" "Phoebe" "Joey"   "Phoebe" "Phoebe" "Ross"

## [1] 69.16667

## [1] "Phoebe"   "Chandler" "Phoebe"   "Phoebe"   "Ross"     "Joey"

## [1] 69.83333

## [1] "Monica"   "Chandler" "Monica"   "Monica"   "Chandler" "Phoebe"

## [1] 67.83333

## [1] "Monica"   "Ross"     "Phoebe"   "Monica"   "Chandler" "Chandler"

## [1] 69.16667

## [1] "Ross"   "Joey"   "Ross"   "Ross"   "Phoebe" "Phoebe"

## [1] 70.83333

28 / 38

Illustration

boot = 10000
friends = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
heights = c(65, 65, 68, 70, 72, 73)
sample_means = NULL
for(i in 1:boot){
  this_sample = sample(heights, size = length(heights), replace = TRUE)
  sample_means[i] = mean(this_sample)
}
mean(sample_means)

## [1] 68.85162

29 / 38

Illustration

30 / 38

Bootstrapped Mean

Let's say we have a sample of 100 participants that complete an IQ test. IQ tests typically have a mean of 100 and a standard deviation of 15. We want to get the mean IQ of our sample of 100 participants, but we want it to be a robust mean -- that is, we want to be pretty darn confident in our mean. What should we do?

31 / 38

Bootstrapped Mean

We can bootstrap our mean! And we can do that using a for loop:

for each iteration of ## of iterations...
- Randomly sample FROM your current sample (so choose 100 from your 100)
- But when choosing your new 100 data points, you are going to REPLACE each participant
- This means Participant #1 could be chosen many times
- On the next iteration (e.g., i + 1) do another round of choose 100 from your original 100 with replacement
- etc.
After we run all of our iterations, then we will get the mean of means

31 / 38

Bootstrapping

This whole process sounds like something you are already familiar with -- sampling distributions

You can think of bootstrapping as building up your sampling distribution for whatever statistic you want. But instead of repeating your experiment 1000x, you're using a random sample of your current experiment.

32 / 38

Bootstrapped Means Example

iqs

##  [1]  96 106 160 115 117 164 127  75  92 100 150 124 125 116  96 167 128  54 134
## [20]  99  81 106  82  91  94  62 138 118  79 151 126 104 140 139 138 134 130 111
## [39] 104 102  92 107  75 178 149  79 101  99 136 110

33 / 38

Bootstrapped Means Example

bsMeans <- NULL
for (i in 1:100) {
  subsample <- sample(x = iqs,
                      size = 50,
                      replace = T)
  bsMeans[i] <- mean(subsample)
}

34 / 38

Bootstrapped Means Example

Let's look at our vector of means

bsMeans

##   [1] 108.24 116.12 103.58 110.66 119.66 111.28 124.18 114.98 115.40 114.98
##  [11] 110.76 116.42 113.02 111.06 112.32 108.14 111.30 115.70 115.96 121.50
##  [21] 112.78 110.96 108.40 108.22 122.12 108.84 119.16 118.40 114.32 112.78
##  [31] 114.36 113.78 117.92 121.16 112.54 116.14 108.82 120.72 115.00 116.30
##  [41] 112.56 111.24 112.28 111.98 107.18 114.52 114.58 113.20 112.62 114.42
##  [51] 122.10 117.58 107.96 111.52 117.52 107.58 107.36 115.54 109.72 115.74
##  [61] 116.36 114.90 112.56 110.34 112.04 111.72 110.42 114.38 119.72 115.62
##  [71] 116.48 118.86 119.82 114.64 113.00 116.86 109.86 114.36 109.26 115.66
##  [81] 108.84 116.18 113.70 118.86 111.24 112.08 108.12 119.14 119.12 111.92
##  [91] 113.32 113.90 120.70 121.54 108.76 107.96 107.82 108.14 115.66 111.90

35 / 38

Bootstrapped Means Example

Now, get the mean of our vector of means...this is our bootstrapped mean

mean(bsMeans)

## [1] 113.8094

36 / 38

Bootstrapped Means Example

Now, get the mean of our vector of means...this is our bootstrapped mean

mean(bsMeans)

## [1] 113.8094

When I made these fake IQ scores, I set the "true" population mean to be 113.

Our mean of the original sample was mean(iqs) = 114.02

Our bootstrapped mean is mean(bsMeans) = 113.8094

36 / 38

Bootstrapped Means Example

Now, get the mean of our vector of means...this is our bootstrapped mean

mean(bsMeans)

## [1] 113.8094

When I made these fake IQ scores, I set the "true" population mean to be 113.

Our mean of the original sample was mean(iqs) = 114.02

Our bootstrapped mean is mean(bsMeans) = 113.8094

The bootstrapped mean is closer to the true population mean -- it's more robust, and therefore we trust it more.

36 / 38

What can you bootstrap?

Any statistic!

central tendency (means, medians, modes)
dispersion (variances, standard deviations)
other estimates (confidence intervals, reliability coefficients, correlations, etc.)

Models!

on each iteration where you draw with replacement, run your model
then take the mean of all the coefficients (like regression coefficients, t-values, etc.)

what can't you bootstrap?

37 / 38

Final ThoughtsCode is supposed to make your life EASIER. Use for loops to your advantage! That means if you find yourself copying/pasting the same thing a billion times with only minor changes, there’s likely a much simpler way of doing everything you need all in one go.check out nested if/else statements in the practice set! 
for R specific functions, check out the lapply() function and the purr package
when you are first writing your for loop and testing things, use really small iterations; once it's working properly, then you can run a ton of iterations

Bootstrapping is just a for loop. You can bootstrap anything your heart desires. 
The biggest piece of advice I have is to think carefully about what you want the end result to look like and then work backwards. Don’t just start doing stuff to your dataset because you think that’s what your supposed to do. Think “in order to make this plot, I need my data in this particular format, what do I need to do to get there?”
38 / 38

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

For Loops & Bootstrapping

The Problem

The Problem

How do we address this?

How do we address this?

Lists

Lists

Lists

Lists

Lists

Lists are weird

Now on to for loops

for loops

for loops

for loops

A Simple Example

Storing the output

Storing the output

Recap

Applied Examples

Applied Examples

Applied Example 1

Applied Example 1

Applied Example 2

Applied Example 2

Applied Example 2.1

Applied Example 2.1

Applied Example 3

Applied Example 3

Applied Example 3

Applied Example 3

Applied Example 3

Bootstrapping

Bootstrapping

Bootstrapping

Bootstrapping

Bootstrapping

Bootstrapping

Illustration

Illustration

Illustration

Bootstrapped Mean

Bootstrapped Mean

Bootstrapping

Bootstrapped Means Example

Bootstrapped Means Example

Bootstrapped Means Example

Bootstrapped Means Example

Bootstrapped Means Example

Bootstrapped Means Example

What can you bootstrap?

Final Thoughts

The Problem

Help

Now on to `for loops`

`for loops`

`for loops`

`for loops`