Practice Set: For Loops

For Loops

For loops are important for completing iterative actions in an efficient manner. The syntax for for loops looks like the following:

for (i in 1:some_number) {
  do something
}

The 1:some number portion refers to the elements you’re iterating over. If you wanted to do something 5 times, you could say 1:5.

FYI: You can iterate over lots of things! This includes things like:

  • vectors
  • rows of a data.frame
  • columns of a data.frame
  • lists

The i stands for “each”. So the first line in the example code above reads as: For each item in 1 through some number…

Lastly, the section between the curly brackets {} is the body of the for loop. This area includes all of the commands that will be executed with every iteration of the loop.

Quick Quiz

Most of the time, we want to store the outcome of whatever commands occur during the body of the for loop. In order to do this, we have to initialize an object.

Quick Quiz

Drew made the following for loop with the goal of squaring all of the numbers between 1 and 5, and then printing them as a single vector (1 4 9 16 25) to the console. She made a mistake though.

my_object <- NULL

for (i in 1:5) {
  my_object <- i^2
}

print(my_object)

You try!

Introducing if/else statements

In programming languages, if/else statements are used for conditional situations. If a given boolean statement is evaluated to TRUE, then one block of code will be executed, if instead (“else”) the boolean statements evaluates to FALSE, then a different block of code will be executed.

Below is the general format for if/else statements:

if(boolean statement) {
  # code to be executed if boolean is true
} else {
  # code to be executed if boolean is false
}

What situations will you need to use if/else statements for?

If/else is useful for recoding and sorting data based on conditional rules. If/else statements also combine with for loops for very useful coding. Take the following example: Ten students were asked how much they like or dislike provel cheese on a scale of 1 to 10 (10 = love it). However, we want to change this numeric scale to label all values below or equal to 5 as “dislike” and above or equal to 6 “like”.

# Student ratings (numeric)
provel_cheese_rating <- c(1, 1, 9, 3, 5, 4, 7, 1, 3, 10)

# Initialize like vs. dislike vector
like_dislike <- c()

for(i in provel_cheese_rating) {
  if(i <= 5) {
    like_dislike <- append(like_dislike, "dislike")
  } else {
    like_dislike <- append(like_dislike, "like")
  }
} 

# View outcome 
print(like_dislike)
##  [1] "dislike" "dislike" "like"    "dislike" "dislike" "dislike" "like"   
##  [8] "dislike" "dislike" "like"

Using a combination of a for loop and an if/else statement, we were able to recode the vector. Now we know where the students stand on the provel issue…

FYI: There is an additional feature of if/else statements called else if. This feature can be used for more complex conditional situations

What if a simple binary conditional statement won’t cut it? For example, maybe we want to entirely change a series of character labels. Here, we can use elseif to specify all possibilities.

(Note that, realistically, we would probably never write out every name of animal like this. Maybe instead we would reference a database and search for items, and then label their class, etc.)

# Animals
animals <- c("dog", "cat", "iguana", "frog", "cardinal")

# Initialize classes of animals vector
animals_class <- c()

for(i in animals) {
  if(i == "dog") {
    animals_class <- append(animals_class, "mammal")
  } else if(i == "cat") {
    animals_class <- append(animals_class, "mammal")
  } else if(i == "iguana") {
    animals_class <- append(animals_class, "reptile")
  } else if(i == "frog") {
    animals_class <- append(animals_class, "amphibian")
  } else {
    animals_class <- append(animals_class, "bird")
  }
} 

# View outcome 
print(animals_class)
## [1] "mammal"    "mammal"    "reptile"   "amphibian" "bird"

FYI: R has a built in function called ifelse(). It does the same thing as an if/else statement. However, it only has 2 conditions. That is, it does not have the opportunity to use else if for multiple conditional statements.

I have unfortunately seen people nest ifelse() statements. This is incredibly bad form! If you have to do this, it is worth your time to write out a full if/else if/else statement!

You try!

Let’s expand our provel example. Write an if/else if/else statement that says:

  • Scores of 1-3 correspond to “dislike”
  • Scores of 4-6 correspond to “neutral”
  • Scores of 7-10 correspond to “like”

Using for loops for bootstrapping

Now we will use for loops in order to bootstrap our data.

As a reminder:

  • In statistics, bootstrapping is any test or metric that uses random sampling with replacement
  • Examples:
    • bootstrapped means (from random samples of a population)
    • bootstrapped confidence intervals

We are going to use bootstrapping to create our own sampling distribution of the mean, and compare the outcomes after a small number of iterations versus a large number of iterations.

Unfortunately not those kinds of samples…

Reminder: In statistics, the sampling distribution of the mean is a probability distribution of means drawn from numerous samples of the population.

For this example, I have generated a population of data. The data.frame is called pop_data. It has a single column called PopulationMeans. The true population mean of this dataset is 100.1528973.

Here is a histogram of the population data:

ggplot(pop_data, aes(x = PopulationMeans)) +
  geom_histogram(color = "black", fill = "plum4") +
  theme_classic() +
  ylab("Count") +
  xlab("Value") +
  ggtitle("Population distribution")

It’s so great that here we have a population distribution. But in reality, we almost never have the population data. We only have sample data. Let’s randomly choose 30 people from the same population as above (not with replacement…30 unique individuals) and call it our sample. This is equivalent to saying “Our study had a n=30 where participants were college students at Wash U enrolled in Psych 101”. It’s not the entire population – it’s a sample of the population.

sample_data <- data.frame(SampleMeans = sample(pop_data$PopulationMeans, size = 30, replace = FALSE))

ggplot(sample_data, aes(x = SampleMeans)) +
  geom_histogram(color = "black", fill = "plum4", binwidth = 1) +
  theme_classic() +
  ylab("Count") +
  xlab("Value") +
  ggtitle("Sample distribution")

Let’s recap. We have a population distribution, and we randomly chose 30 people in our study. But the vast majority of the time, we don’t have the population distribution to work with. We only have the sample. In traditional NHST procedures, we use formulas and equations to estimate the true underlying sampling distribution of a statistic (like the mean). But when we are bootstrapping, there is no need to estimate – we actually go ahead and build the sampling distribution.

When building the sampling distribution, you have to choose how many times/iterations you want to do this. Larger is better (like running more studies as opposed to fewer studies). Let’s illustrate this.

The mean of the true population (the one we almost never have access to) is 100 The mean of our sample of 30 participants (the one we always have access to; equivalent to running a study) is 97.1142558

We want to bootstrap the mean of our 30 participants to get as close as possible to the true population mean. We’re building up a sampling distribution. We can do this with a for loop where we pull from our sample many times, with replacement. But how many times? For this first attempt, let’s do a small number of samples (10) to create sample_means_small.

# Initialize vector for means
sample_means_small <- NULL

# Use for loop to gather means of each sample
for (i in 1:10) {
  indices <- sample(x = 1:30, size = 30, replace = T)
  subsample <- sample_data[indices,1]
  sample_means_small[i] <- mean(subsample)
}

Next, we will draw a slightly larger number of samples (50) to create sample_means_medium.

# Initialize vector for means
sample_means_medium <- NULL

# Use for loop to gather means of each sample
for (i in 1:100) {
  indices <- sample(x = 1:30, size = 30, replace = T)
  subsample <- sample_data[indices,1]
  sample_means_medium[i] <- mean(subsample)
}

To compare, we will do the same process with a larger set of random samples (1,000) to create sample_means_large.

# Initialize vector for means
sample_means_large <- NULL

# Use for loop to gather means of each sample
for (i in 1:1000) {
  indices <- sample(x = 1:30, size = 30, replace = T)
  subsample <- sample_data[indices,1]
  sample_means_large[i] <- mean(subsample)
}

Visualizing the three sampling distributions give us perspective as to how the number of iterations makes the sampling distribution approach normality.

Additionally, if we take the mean of the sampled means (in order to estimate the population mean), it will become more accurate as the size of the sampling distribution increases. In the plots below, the solid lines marks the true population mean, and the dashed lines mark the sampling distribution means.

plot1 <- ggplot() +
  geom_histogram(data = sample_means_small, aes(values), fill = "plum4", color = "black", binwidth = 1) +
  theme_classic() +
  xlab("Sample means") +
  ylab("Count") +
  geom_vline(xintercept = small_m, linetype = "longdash", color = "red") +
  geom_vline(xintercept = pop_m, color = "green") +
  coord_cartesian(xlim = c(85, 115)) +
  ggtitle("Distribution of 10 samples")

plot2 <- ggplot() +
  geom_histogram(data = sample_means_medium, aes(values), fill = "plum4", color = "black", binwidth = 1) +
  theme_classic() +
  xlab("Sample means") +
  ylab("Count") +
  geom_vline(xintercept = medium_m, linetype = "longdash", color = "red") +
  geom_vline(xintercept = pop_m, color = "green") +
  coord_cartesian(xlim = c(85, 115)) +
  ggtitle("Distribution of 100 samples")

plot3 <- ggplot() +
  geom_histogram(data = sample_means_large, aes(values), fill = "plum4", color = "black") +
  theme_classic() +
  xlab("Sample means") +
  ylab("Count") +
  geom_vline(xintercept = large_m, linetype = "longdash", color = "red") +
  geom_vline(xintercept = pop_m, color = "green") +
  coord_cartesian(xlim = c(85, 115)) +
  ggtitle("Distribution of 1,000 samples")

ggarrange(plot1, plot2, plot3)

Using bootstrapping for confidence intervals

Quick Quiz

OK based on the quiz above, we need to carry out our sampling procedure a large number of times. That’s exactly what we are doing with bootstrapping! So now all we need to do is get the upper and lower 2.5% quantiles and we have our confidence intervals. The real kicker is that you need to bootstrap your statistic, and then find the quantiles to align with your confidence intervals.

You try!

Now we’re going to use bootstrapping to calculate the 95% confidence intervals for a 1-tailed t-test. We want to know if our mean is significantly different from 95.

Tips and Instructions:

  • Use sample_data as your dataset
  • Run multiple iterations to get the difference between the sample mean and 95
  • Store and plot those differences
  • Add in quantiles to get the upper and lower 2.5% aka our confidence intervals around the difference in means

That’s it for this practice – go forth and sample iteratively!

Massive shout out to the Sping 2021 AI Drew McLaughlin for creating this excellent Practice Set!