Copying and pasting code is not an efficient use of your time
Often you want to do the same thing but with different inputs:
Copying and pasting code is not an efficient use of your time
Often you want to do the same thing but with different inputs:
In sum, you're trying to iterate (perform repeatedly)
In this class, we are going to talk specifically about for loops
. Why?
R
and instead go to Python
or Matlab
or whatever, you'll still come across them.In this class, we are going to talk specifically about for loops
. Why?
R
and instead go to Python
or Matlab
or whatever, you'll still come across them.There are other ways to do this that are R
-specific. We are NOT going to cover these (counterintuitive, I know).
apply
family of functions, including lapply
purr
package from the tidyverse
Lists are basically vectors but where every element can be a totally different data class. Ex:
myList <- list(6, head(iris), "hello world", c(1, 3, 5, 7, 9))myList
## [[1]]## [1] 6## ## [[2]]## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa## ## [[3]]## [1] "hello world"## ## [[4]]## [1] 1 3 5 7 9
You can also name the elements in your list:
myList <- list(Numberz = 6, DFs = head(iris), Chars = "hello world", VectorKing = c(1, 3, 5, 7, 9))myList
## $Numberz## [1] 6## ## $DFs## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa## ## $Chars## [1] "hello world"## ## $VectorKing## [1] 1 3 5 7 9
The weirdest thing about lists is accessing the elements within a list. Think of it like a book:
R
to go to the actual chapterThe weirdest thing about lists is accessing the elements within a list. Think of it like a book:
R
to go to the actual chapterWe can do this with double brackets đ
myList[1]
## $Numberz## [1] 6
myList[[1]]
## [1] 6
Why does this matter? Say we wanted to take our number 6
from our list and add it to the number 12
myList[1] + 12
## Error in myList[1] + 12: non-numeric argument to binary operator
myList[[1]] + 12
## [1] 18
From Hadley Wickham, creator of tidyverse
for loops
for (i in 1:some number) { do something}
"For each element in 1 through _, perform some function"
"Perform the function contained in this for loop on every single element in some list of elements"
for loops
for (i in 1:some number) { do something}
The important parts:
1:some number
portion are the elements youâre iterating over. If you wanted to do something 5 times, you could say 1:5
. Most of the time, we donât know that second number, though. So we can use a function. 1:nrow(data.frame)
or 1:length(list)
or 1:length(vector)
.for loops
for (i in 1:some number) { do something}
The important parts:
You can iterate over anything: rows of a data.frame, columns of a data.frame, lists, or simple 1-dimensional vectors. The 1:some number
portion are the elements youâre iterating over. If you wanted to do something 5 times, you could say 1:5
. Most of the time, we donât know that second number, though. So we can use a function. 1:nrow(data.frame)
or 1:length(list)
or 1:length(vector)
.
The i
stands for âeachâ and is the same type of i
seen in equations. The top line then reads: âFor each item/element in 1 through some numberâ.
for loops
for (i in 1:some number) { do something}
The important parts:
You can iterate over anything: rows of a data.frame, columns of a data.frame, lists, or simple 1-dimensional vectors. The 1:some number
portion are the elements youâre iterating over. If you wanted to do something 5 times, you could say 1:5
. Most of the time, we donât know that second number, though. So we can use a function. 1:nrow(data.frame)
or 1:length(list)
or 1:length(vector)
.
The i
stands for âeachâ and is the same type of i
seen in equations. The top line then reads: âFor each item/element in 1 through some numberâ.
The part between the curly brackets { }
is what you want to do (itâs the body of the for loop).
For each number in 1 through 10, print the following: "(number) squared is (that number squared)"
So for the number 2, the output should be "2 squared is 4"
The functions we are going to use are print()
and paste0()
.
for (i in 1:10) { print(paste0(i, " squared is ", i^2))}
## [1] "1 squared is 1"## [1] "2 squared is 4"## [1] "3 squared is 9"## [1] "4 squared is 16"## [1] "5 squared is 25"## [1] "6 squared is 36"## [1] "7 squared is 49"## [1] "8 squared is 64"## [1] "9 squared is 81"## [1] "10 squared is 100"
Is there any way for me to access any of the squared numbers from above? No!
If you want to store the output of a loop (which in almost all cases we do), you need to initialize an empty object. This means make a blank object before running your loop that will contain your stored results.
Is there any way for me to access any of the squared numbers from above? No!
If you want to store the output of a loop (which in almost all cases we do), you need to initialize an empty object. This means make a blank object before running your loop that will contain your stored results.
Let's run the same loop as last time, but this time let's store the squared numbers, rather than just printing out some lines.
squaredNumbers <- NULLfor (i in 1:10) { squaredNumbers[i] <- i^2}squaredNumbers
## [1] 1 4 9 16 25 36 49 64 81 100
The basic steps of constructing a for loop
:
Figure out what it is that you want to iterate through (numbers, columns of a data.frame, data.frames, a list, etc.)
Think about what you want the output to look like, and initialize an empty object that can store the output
Type out all of the steps within the body of the loop
What if you want to run the following models where the Y
dependent variable stays the same, but the X
independent variable changes? For this first example, let's say we only care about the p-value.
lm(Sepal.Length ~ Sepal.Width, data = iris)
lm(Sepal.Length ~ Petal.Length, data = iris)
lm(Sepal.Length ~ Petal.Width, data = iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa
Thinking through what we need to happen:
We're iterating through specific columns of the iris
data.frame -- so we need to get those column names into a form we can iterate through.
The output is a vector that contains p-values only
The body of the loop needs to contain the lm()
functions
varsToIterate <- colnames(iris)[2:4]sigValues <- NULLfor (i in 1:length(varsToIterate)) { model <- lm(Sepal.Length ~ iris[,i], data = iris) model <- tidy(model) sig <- model[[2,5]] sigValues[i] <- sig}sigValues
## [1] 0.000000e+00 1.518983e-01 1.038667e-47
You can make comments within for loops
# get the column names in a format that we can iterate throughvarsToIterate <- colnames(iris)[2:4]# initialize an empty output vectorsigValues <- NULLfor (i in 1:length(varsToIterate)) { # run the model where i is the column that is varying model <- lm(Sepal.Length ~ iris[,i], data = iris) # the tidy function comes from the `broom` package model <- tidy(model) # find the p-value from the tidied model sig <- model[[2,5]] # store the p-value in the output sigValues[i] <- sig}# print the outputsigValues
## [1] 0.000000e+00 1.518983e-01 1.038667e-47
Now, what if you want to store the entire output of the model, not just the p-value? The output of tidy(model)
is a data.frame. So the result of our loop will now be a list of data.frames, rather than a vector of p-values. Note the double brackets in modelList[[i]]
!
varsToIterate <- colnames(iris)[2:4]modelList <- list()for (i in 1:length(varsToIterate)) { model <- lm(Sepal.Length ~ iris[,i], data = iris) model <- tidy(model) modelList[[i]] <- model}
print(modelList)
## [[1]]## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 0 3.79e-17 0 1## 2 iris[, i] 1 6.43e-18 1.56e17 0## ## [[2]]## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 6.53 0.479 13.6 6.47e-28## 2 iris[, i] -0.223 0.155 -1.44 1.52e- 1## ## [[3]]## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 4.31 0.0784 54.9 2.43e-100## 2 iris[, i] 0.409 0.0189 21.6 1.04e- 47
Let's modify this slightly so that each output data.frame has a name associated with it (rather than 1, 2, 3)
varsToIterate <- colnames(iris)[2:4]modelList <- list()for (i in 1:length(varsToIterate)) { name <- paste0(varsToIterate[i]) model <- lm(Sepal.Length ~ iris[,i], data = iris) model <- tidy(model) modelList[[name]] <- model}
print(modelList)
## $Sepal.Width## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 0 3.79e-17 0 1## 2 iris[, i] 1 6.43e-18 1.56e17 0## ## $Petal.Length## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 6.53 0.479 13.6 6.47e-28## 2 iris[, i] -0.223 0.155 -1.44 1.52e- 1## ## $Petal.Width## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 4.31 0.0784 54.9 2.43e-100## 2 iris[, i] 0.409 0.0189 21.6 1.04e- 47
Let's do the same thing but with some modifications:
Let's also plot X & Y so we have a figure that corresponds with each model
Instead of the output being a list of the different models, what if we wanted the information from all of the models to be contained within a data.frame?
Let's do the same thing but with some modifications:
Let's also plot X & Y so we have a figure that corresponds with each model
Instead of the output being a list of the different models, what if we wanted the information from all of the models to be contained within a data.frame?
When it comes to looping through plots, there are a few odd things:
sym()
function will take the quotes off of a string so that it can be evaluated properly!!
says "actually evaluate what a variable stands for". This will make sense when you see it in the code.We want to store the model outputs AND a list of plots. So we need to initialize 2 things
modelDF <- data.frame() # for modelsplotList <- list() # for plotsfor (i in 1:length(varsToIterate)) { nameX <- paste0(varsToIterate[i]) # a character string for labels nameY <- "Sepal.Length" # doesn't change in the loop! # make the models model <- lm(Sepal.Length ~ iris[,i], data = iris) model <- tidy(model) # add a column that repeats whatever nameX is # for us, this will make it easier to keep track of what "i" is model$predictor <- rep(nameX, times = nrow(model)) # now bind the current model underneath the previous model # so that it's all contained within the same data.frame modelDF <- rbind(modelDF, model) # now make the plots nameXplot <- sym(varsToIterate[i]) plotList[[i]] <- ggplot(data = iris, aes(x = !! nameXplot, y = Sepal.Length)) + geom_point(color = "cornflowerblue") + labs(title = paste0(nameX, " by ", nameY), x = nameX, y = nameY)}
modelDF
## # A tibble: 6 x 6## term estimate std.error statistic p.value predictor ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 (Intercept) 0 3.79e-17 0 1 e+ 0 Sepal.Width ## 2 iris[, i] 1 6.43e-18 1.56e17 0 Sepal.Width ## 3 (Intercept) 6.53 4.79e- 1 1.36e 1 6.47e- 28 Petal.Length## 4 iris[, i] -0.223 1.55e- 1 -1.44e 0 1.52e- 1 Petal.Length## 5 (Intercept) 4.31 7.84e- 2 5.49e 1 2.43e-100 Petal.Width ## 6 iris[, i] 0.409 1.89e- 2 2.16e 1 1.04e- 47 Petal.Width
## [[1]]
## ## [[2]]
## [[1]]
to get oneself out of a situation using existing resources
to get oneself out of a situation using existing resources
In statistics...any test or metric that uses random sampling with replacement
to get oneself out of a situation using existing resources
In statistics...any test or metric that uses random sampling with replacement
to get oneself out of a situation using existing resources
In statistics...any test or metric that uses random sampling with replacement
SAMPLING WITH REPLACEMENT
But what if we don't know the theoretical sampling distribution or we can't verify it?
We build the sampling distribution empirically by random sampling with replacement from the sample
Imagine you had a sample of 6 people that happen to live in NYC in the mid 1990's. Let's just say their names are Rachel, Monica, Phoebe, Joey, Chandler, and Ross. Maybe we have the height of each of these 6 individuals (in inches). But now let's say we want to get a sense of the average height in a person in their mid 20's in the 1990's in NYC. We can bootstrap the mean height. That is, we are going to draw from this group many samples of 6 people with replacement, each time calculating the average height of the sample.
## [1] "Monica" "Chandler" "Phoebe" "Monica" "Chandler" "Phoebe"
## [1] 68.33333
## [1] "Monica" "Rachel" "Ross" "Phoebe" "Phoebe" "Ross"
## [1] 68.66667
## [1] "Monica" "Ross" "Chandler" "Joey" "Chandler" "Ross"
## [1] 70.83333
## [1] "Phoebe" "Phoebe" "Joey" "Phoebe" "Phoebe" "Ross"
## [1] 69.16667
## [1] "Phoebe" "Chandler" "Phoebe" "Phoebe" "Ross" "Joey"
## [1] 69.83333
## [1] "Monica" "Chandler" "Monica" "Monica" "Chandler" "Phoebe"
## [1] 67.83333
## [1] "Monica" "Ross" "Phoebe" "Monica" "Chandler" "Chandler"
## [1] 69.16667
## [1] "Ross" "Joey" "Ross" "Ross" "Phoebe" "Phoebe"
## [1] 70.83333
boot = 10000friends = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")heights = c(65, 65, 68, 70, 72, 73)sample_means = NULLfor(i in 1:boot){ this_sample = sample(heights, size = length(heights), replace = TRUE) sample_means[i] = mean(this_sample)}mean(sample_means)
## [1] 68.85162
Let's say we have a sample of 100 participants that complete an IQ test. IQ tests typically have a mean of 100 and a standard deviation of 15. We want to get the mean IQ of our sample of 100 participants, but we want it to be a robust mean -- that is, we want to be pretty darn confident in our mean. What should we do?
Let's say we have a sample of 100 participants that complete an IQ test. IQ tests typically have a mean of 100 and a standard deviation of 15. We want to get the mean IQ of our sample of 100 participants, but we want it to be a robust mean -- that is, we want to be pretty darn confident in our mean. What should we do?
We can bootstrap our mean! And we can do that using a for loop
:
i + 1
) do another round of choose 100 from your original 100 with replacementThis whole process sounds like something you are already familiar with -- sampling distributions
You can think of bootstrapping as building up your sampling distribution for whatever statistic you want. But instead of repeating your experiment 1000x, you're using a random sample of your current experiment.
iqs
## [1] 96 106 160 115 117 164 127 75 92 100 150 124 125 116 96 167 128 54 134## [20] 99 81 106 82 91 94 62 138 118 79 151 126 104 140 139 138 134 130 111## [39] 104 102 92 107 75 178 149 79 101 99 136 110
bsMeans <- NULLfor (i in 1:100) { subsample <- sample(x = iqs, size = 50, replace = T) bsMeans[i] <- mean(subsample)}
Let's look at our vector of means
bsMeans
## [1] 108.24 116.12 103.58 110.66 119.66 111.28 124.18 114.98 115.40 114.98## [11] 110.76 116.42 113.02 111.06 112.32 108.14 111.30 115.70 115.96 121.50## [21] 112.78 110.96 108.40 108.22 122.12 108.84 119.16 118.40 114.32 112.78## [31] 114.36 113.78 117.92 121.16 112.54 116.14 108.82 120.72 115.00 116.30## [41] 112.56 111.24 112.28 111.98 107.18 114.52 114.58 113.20 112.62 114.42## [51] 122.10 117.58 107.96 111.52 117.52 107.58 107.36 115.54 109.72 115.74## [61] 116.36 114.90 112.56 110.34 112.04 111.72 110.42 114.38 119.72 115.62## [71] 116.48 118.86 119.82 114.64 113.00 116.86 109.86 114.36 109.26 115.66## [81] 108.84 116.18 113.70 118.86 111.24 112.08 108.12 119.14 119.12 111.92## [91] 113.32 113.90 120.70 121.54 108.76 107.96 107.82 108.14 115.66 111.90
Now, get the mean of our vector of means...this is our bootstrapped mean
mean(bsMeans)
## [1] 113.8094
Now, get the mean of our vector of means...this is our bootstrapped mean
mean(bsMeans)
## [1] 113.8094
When I made these fake IQ scores, I set the "true" population mean to be 113.
Our mean of the original sample was mean(iqs)
= 114.02
Our bootstrapped mean is mean(bsMeans)
= 113.8094
Now, get the mean of our vector of means...this is our bootstrapped mean
mean(bsMeans)
## [1] 113.8094
When I made these fake IQ scores, I set the "true" population mean to be 113.
Our mean of the original sample was mean(iqs)
= 114.02
Our bootstrapped mean is mean(bsMeans)
= 113.8094
The bootstrapped mean is closer to the true population mean -- it's more robust, and therefore we trust it more.
Any statistic!
Models!
what can't you bootstrap?
for loops
to your advantage! That means if you find yourself copying/pasting the same thing a billion times with only minor changes, thereâs likely a much simpler way of doing everything you need all in one go.if/else
statements in the practice set! R
specific functions, check out the lapply()
function and the purr
packagefor loop
and testing things, use really small iterations; once it's working properly, then you can run a ton of iterationsfor loop
. You can bootstrap anything your heart desires. Copying and pasting code is not an efficient use of your time
Often you want to do the same thing but with different inputs:
Keyboard shortcuts
â, â, Pg Up, k | Go to previous slide |
â, â, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |