Practice Set: Tidyverse

Tidyverse

Without data, there would be no science. Without clean data, there would be no analysis! Just because researchers have data does not mean they can simply throw it into a regression and be done with it.

In reality, researchers spend much more time cleaning and preparing their data than actually analyzing it: up to 80% of the time spent on data analysis tasks dealt with simply cleaning data (Dasu & Johnson, 2003)!

Evidently, data cleaning is vital to research, but it is not as difficult as it may seem! Fortunately, the collection of R packages in Tidyverse make data cleaning incredibly less arduous and painstaking; therefore, learning how to work with Tidyverse will make your life all the more simple.

dplyr

The first (and arguably most important) of Tidyverse’s core packages is dplyr. Of this package, we will focus on the five functions arguably most vital for data cleaning:

  • filter()
  • select()
  • mutate()
  • summarise()
  • group_by()

We will begin with using dplyr on the steam_hw dataset. The data comes from researcher Hao Lin at Tianjin University of Technology (Lin, 2024). Broadly, their research revolves around relationships between Big Five personality traits and online gaming behavior, primarily Steam platform activity.

Loading in the data

When you load in the dataset, the first column (usually labeled …1 if using read_csv()) contains values indexing the row number, which is redundant with the row names of the data frame that already show this.

I like to remove this column to reduce clutter, and we can do that by using the select() function.

Hint

Remember from the lecture that select() can not only be used to choose which columns you want to keep, but also ones you want to get rid of:

You try!

What is the code for viewing the age column?

We can see here that there are 3 participants not from China. I want to remove them from the data set, and I can do that using filter().

steam_china_1 <- steam_hw |>
  filter(country == "china")

Filtering the data

Since this data was collected from a Chinese institution, I may be interested in looking at the gaming behavior of only Chinese participants.

To make sure I am only making inferences on Chinese participants, we would need to filter() participants who are only from China.

We can quickly view the contents of only the country column by using the select() function.

steam_hw |>
  select(country) |>
  print()
## # A tibble: 171 × 1
##    country
##    <chr>  
##  1 china  
##  2 china  
##  3 china  
##  4 china  
##  5 china  
##  6 china  
##  7 china  
##  8 china  
##  9 china  
## 10 china  
## # ℹ 161 more rows

You try!

What if I wanted to have a data frame with only people from Guatemala? Store this as an object called steam_guatemala.

If we look at this new data frame, our number of participants (i.e., rows) dropped from 171 to 149. This is much more than just 3 participants that we are removing.

What happened? Is there something wrong with the filter() function?

There are a couple ways we can tackle this, and they involve using logical operators.

We could, instead, filter out the other countries that are not China:

steam_china_2 <- steam_hw |>
  filter(!(country == "Afghanistan" | country == "Australia" | country == "Guatemala"))

There are a couple of things to notice with the above code chunk.

  1. I used ! with filter: the ! means “not”, so I am keeping the observations that are not following the subsequent filter specifications.

  2. Because there are 3 different countries in the country column that I am not interested in, I need to list these 3 in the filter() function. For this to work, I used “|” which means “or”. So I am only keeping participants who did not respond with Afghanistan or Australia or Guatemala.

This won’t be the most efficient way if there are more than 3 countries you would have to specify in the filter() command.

You try!

What is another way to filter for participants from China that takes this spelling difference into consideration? Call the output steam_china.

Hint: Remember the logical operators.

Regardless, we now have a data frame with 168 observations: 3 less than the 171 we initially worked with, showing that we have now filtered the correct number of participants.

You try!

  • If I wanted to filter for participants that are both from China and female, how could I do this? Call the result steam_china_female

  • What about males who are 18-25 years old? Call the result steam_young_male

Hint: Remember the logical operators.

Personality variables

One of the standard ways of assessing the Big Five personality traits is by using the BFI-2 inventory (John & Soto, 2017), which contains 12 items per personality trait (12 items times 5 traits = 60 total items), with each item measured on a scale from 1 to 5 (1 = strongly disagree and 5 = strongly agree).

In case you are not familiar, the Big Five contains these traits:

  • Extraversion
  • Agreeableness
  • Conscientiousness
  • Neuroticism
  • Openness

Looking back at our data frame, we see that participants’ scores on each item corresponds to the columns labeled with a letter, representing the trait, a number, and another letter. Just looking at individual items won’t tell us, for example, how extraverted a participant is; instead, we need some variable that gives us a combined score of the relevant items to get a personality trait score.

A simple metric is to take the sum of numeric response from each item, so participants will have a combined score of how extraverted they are (e.g., 52 out of 60 possible).

So, let us make 5 new variables that will give us the sum of participant’s trait scores along each of the 5 personality traits.

To do so, we need to sum across the items that correspond to each trait, separately (e.g., all Extraversion items summed together, all Agreeableness items, etc.). To do this, we can use across() which will apply a function over multiple columns, and combine that with the selecting function starts_with(), which we can use to specify which columns we want to sum across for each of the new variables we will create.

These functions are all nested within the mutate() command, which will add these new columns to the data frame:

china_sum_1 <- steam_china |>
  mutate(
    Extra_sum = sum(across(starts_with("E_")), na.rm = TRUE),
    Agre_sum = sum(across(starts_with("A_")), na.rm = TRUE),
    Cons_sum = sum(across(starts_with("C_")), na.rm = TRUE),
    Neuro_sum = sum(across(starts_with("N_")), na.rm = TRUE),
    Open_sum = sum(across(starts_with("O_")), na.rm = TRUE)
  )

kable(china_sum_1[c(1:10), c(1, 73:77)])
id Extra_sum Agre_sum Cons_sum Neuro_sum Open_sum
1 5789 7200 6614 5709 7127
2 5789 7200 6614 5709 7127
3 5789 7200 6614 5709 7127
4 5789 7200 6614 5709 7127
5 5789 7200 6614 5709 7127
6 5789 7200 6614 5709 7127
7 5789 7200 6614 5709 7127
8 5789 7200 6614 5709 7127
9 5789 7200 6614 5709 7127
10 5789 7200 6614 5709 7127

In this new data frame, we successfully summed across the proper items, but the sum we get is MUCH larger than we would expect, and it is the same for every participant. We want to get the sum for each participant, not the total across all participants.

To do this, we need to make use of the group_by() function:

china_sum_2 <- steam_china |>
  group_by(id) |>
  mutate(
    Extra_sum = sum(across(starts_with("E_")), na.rm = TRUE),
    Agre_sum = sum(across(starts_with("A_")), na.rm = TRUE),
    Cons_sum = sum(across(starts_with("C_")), na.rm = TRUE),
    Neuro_sum = sum(across(starts_with("N_")), na.rm = TRUE),
    Open_sum = sum(across(starts_with("O_")), na.rm = TRUE)
  ) |>
  ungroup()

kable(china_sum_2[c(1:10), c(1, 73:77)])
id Extra_sum Agre_sum Cons_sum Neuro_sum Open_sum
1 52 60 60 15 60
2 36 38 39 33 37
3 38 44 46 33 29
4 36 46 41 34 38
5 28 46 45 32 40
6 26 32 39 36 52
7 38 41 57 26 41
8 42 40 38 35 38
9 31 45 19 47 57
10 38 41 41 42 44

By using group_by(), we are specifying that the mutate() command will occur for each participant, separately. The result is 5 different columns with sums that correspond with each participant’s item data.

Group comparison

Now that we have some metric of participants’ personality scores, we can do some analysis!

Since we have a gender variable, I might be interested in looking at, potentially, personality differences that emerge between gender groups.

We can take a quick glance at such differences using group_by() and summarise():

china_sum_2 |>
  group_by(gender) |>
  summarise(
    Extra_avg = mean(Extra_sum, na.rm = TRUE),
    Agre_avg = mean(Agre_sum, na.rm = TRUE),
    Cons_avg = mean(Cons_sum, na.rm = TRUE),
    Neuro_avg = mean(Neuro_sum, na.rm = TRUE),
    Open_avg = mean(Open_sum, na.rm = TRUE)
  )
## # A tibble: 2 × 6
##   gender Extra_avg Agre_avg Cons_avg Neuro_avg Open_avg
##   <chr>      <dbl>    <dbl>    <dbl>     <dbl>    <dbl>
## 1 female      33.6     42.7     38.1      35.4     45.3
## 2 male        34.9     43.0     40.1      33.1     40.8

Using summarise() across each of the personality trait sum columns we just made, after grouping by our gender variable, we get a 2x6 data frame of the average sum for each personality trait between male and female participants.

You try!

How could I get this get summary but looking across the different age groups? Call the new columns:

  • Extra_avg
  • Agre_avg
  • Cons_avg
  • Neuro_avg
  • Open_avg

Hint: How do you deal with missing data?

Notice, as well, that the shape of the data frame changed dramatically using summarise().

Quick Quiz

There seems to be relatively no differences between the gender groups for most of the personality sums, but one seems relatively large.

We can test formally test this using the statistical functions from the previous chapter Stats & Plot!

fit_Open <- lm(Open_sum ~ factor(gender), data = china_sum_2)

(summ_Open <- summary(fit_Open))
## 
## Call:
## lm(formula = Open_sum ~ factor(gender), data = china_sum_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.7944  -4.7944  -0.7944   5.7213  19.2056 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         45.2787     0.9949  45.509  < 2e-16 ***
## factor(gender)male  -4.4843     1.2467  -3.597 0.000424 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.771 on 166 degrees of freedom
## Multiple R-squared:  0.07231,	Adjusted R-squared:  0.06672 
## F-statistic: 12.94 on 1 and 166 DF,  p-value: 0.0004244

From the summary, we do, indeed, find that there is a significant difference in average Openness between males and females.

And, for better visualization, we can plot this difference! As we learn about data visualization, you can get plots that look like this:

china_plot <- as.data.frame(predict(fit_Open, interval = "confidence"))

china_plot$gender <- china_sum_2$gender

china_plot |>
  ggplot() +
  geom_bar(aes(x = gender, y = fit, fill = gender), 
           #stat and fun is used to plot the means of the predicted values
           stat = "summary", fun = "mean", width = 0.75, alpha = 0.5) +
  geom_errorbar(aes(x = gender, ymin = lwr, ymax = upr), width = 0.1, linewidth = 0.75) +
  #geom_point() is for adding in the individual data points
  geom_point(aes(x = factor(gender), y = fit, color = gender), 
             position = "jitter", alpha = 0.5) +
  scale_x_discrete(labels = c("female" = "Female", "male" = "Male")) +
  ylim(0, 50) + #changing the limits of the y-axis
  labs(
    title = "Gender Differences in Openness",
    subtitle = "(95% Confidence Intervals)",
    x = "Gender",
    y = "Predicted Openness (Sum)"
  ) +
  theme_classic() +
  #the below code is all for editing the size and positioning of the titles and labels
  theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(size = 14, hjust = 0.5),
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12, margin = margin(r = 10)),
        legend.position = "none")

tidyr

Now, in utilizing functions from the tidyr package in Tidyverse, we can make more complex transformations to the data frame to answer more questions about our data.

For this, let’s focus on Neuroticism, the trait revolving around the tendency to experience and regulate negative emotions. Neuroticism has been found to be positively associated with Internet addiction (Kayiş et al., 2016), and people high in Neuroticism tend to use social media to find support (Tang et al., 2016).

So, we can hypothesize that Neuroticism will be positively associated with having more online, Steam friends. I ran the code, and it is!

Creating composite scores

However, while using the sum of personality items to make a trait score is a simple and intuitive metric, that is not what is typically used by personality researchers. Instead, we use composite scores to make trait scores, which is the average score across the item responses.

In addition, the 12 items that correspond to each of the 5 personality traits also belong to different facets: sub-domains within each trait. The BFI-2 measures 3 facets per personality trait (which can be seen by the lower-case letter at the end of each item name), with 4 items for each trait tapping into each related facet. For Neuroticism, Anxiety, Depression, and Emotional Volatility are the different facets.

To look at all of these different constructs, we need to create composite scores for each, and we can do this using pivot_longer().

First, let’s look at how our wide data frame looks:

print(china_sum_2[1:10, c(1, 8, 13, 18, 23)])
## # A tibble: 10 × 5
##       id N_1_a N_2_d N_3_e N_4_a
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1     1     1     1     2
##  2     2     2     2     3     4
##  3     3     3     2     3     3
##  4     4     3     2     3     3
##  5     5     4     3     2     5
##  6     6     4     3     3     2
##  7     7     2     2     2     4
##  8     8     3     2     1     4
##  9     9     4     4     3     3
## 10    10     3     2     4     4

Now, let’s see what pivot_longer() does, making a data frame with only our Neuroticism items using select().

#remember to also include our grouping variable "id"
neuro_long <- china_sum_2 |>
  select(id, starts_with("N_")) |>
  pivot_longer(
    cols = -id,
    names_to = "item",
    values_to = "score"
  )

print(neuro_long[1:20, ])
## # A tibble: 20 × 3
##       id item   score
##    <dbl> <chr>  <dbl>
##  1     1 N_1_a      1
##  2     1 N_2_d      1
##  3     1 N_3_e      1
##  4     1 N_4_a      2
##  5     1 N_5_d      1
##  6     1 N_6_e      1
##  7     1 N_7_a      2
##  8     1 N_8_d      1
##  9     1 N_9_e      1
## 10     1 N_10_a     2
## 11     1 N_11_d     1
## 12     1 N_12_e     1
## 13     2 N_1_a      2
## 14     2 N_2_d      2
## 15     2 N_3_e      3
## 16     2 N_4_a      4
## 17     2 N_5_d      3
## 18     2 N_6_e      3
## 19     2 N_7_a      2
## 20     2 N_8_d      3

Now we have a data frame with 2,016 observations, and the data frame looks A LOT different!

To get facet scores, R needs to know which items belong to which facet of Neuroticism. We, therefore, need to separate the ending letter from each item into another column using another tidyr function, separate():

facet_long_1 <- neuro_long |>
  separate(col = item,
           into = c("item", "facet"))
## Warning: Expected 2 pieces. Additional pieces discarded in 2016 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
print(facet_long_1[1:20, ])
## # A tibble: 20 × 4
##       id item  facet score
##    <dbl> <chr> <chr> <dbl>
##  1     1 N     1         1
##  2     1 N     2         1
##  3     1 N     3         1
##  4     1 N     4         2
##  5     1 N     5         1
##  6     1 N     6         1
##  7     1 N     7         2
##  8     1 N     8         1
##  9     1 N     9         1
## 10     1 N     10        2
## 11     1 N     11        1
## 12     1 N     12        1
## 13     2 N     1         2
## 14     2 N     2         2
## 15     2 N     3         3
## 16     2 N     4         4
## 17     2 N     5         3
## 18     2 N     6         3
## 19     2 N     7         2
## 20     2 N     8         3

When we do separate() here, however, we lose some information, and R gives us a warning telling us that this happened. Why is this happening?

To prevent this, we will need to use separate() to split them into three columns first, and then we can use unite() to join the trait and item number values back together:

facet_long_2 <- neuro_long |>
  separate(col = item,
           into = c("trait", "item_number", "facet")) |>
  unite(col = "item",
        trait, item_number,
        sep = "_")

print(facet_long_2[1:20, ])
## # A tibble: 20 × 4
##       id item  facet score
##    <dbl> <chr> <chr> <dbl>
##  1     1 N_1   a         1
##  2     1 N_2   d         1
##  3     1 N_3   e         1
##  4     1 N_4   a         2
##  5     1 N_5   d         1
##  6     1 N_6   e         1
##  7     1 N_7   a         2
##  8     1 N_8   d         1
##  9     1 N_9   e         1
## 10     1 N_10  a         2
## 11     1 N_11  d         1
## 12     1 N_12  e         1
## 13     2 N_1   a         2
## 14     2 N_2   d         2
## 15     2 N_3   e         3
## 16     2 N_4   a         4
## 17     2 N_5   d         3
## 18     2 N_6   e         3
## 19     2 N_7   a         2
## 20     2 N_8   d         3

Now, we can explicitly tell R which items belong to which facet of Neuroticism, and we will use this variable to calculate facet-level composite scores.

You try!

We want each participant to have 3 composite (average) scores, one for each facet.

  • What will we need to include in group_by() to do this?
  • What operation within mutate() will we use to make the composites?
  • Show the first 10 rows and all of the columns

Hint: Remember when we used group_by(id) to get the correct column of personality score sums, per person.

By using group_by(id, facet), we group by both of these variables to get a composite score of each facet that within each participant’s own data.

We are not done, however, as there is some unnecessary repetition going on in the facet_comp variable. What repetition is going on?

We can take the mean of facet_comp, grouping by id and facet, again, which will give us a single score for each facet within each participant:

facet_long_4 <- facet_long_3 |>
  group_by(id, facet) |>
  summarise(
    facet_comp = mean(facet_comp, na.rm = TRUE)
  )

print(facet_long_4[1:10, ])
## # A tibble: 10 × 3
## # Groups:   id [4]
##       id facet facet_comp
##    <dbl> <chr>      <dbl>
##  1     1 a           1.75
##  2     1 d           1   
##  3     1 e           1   
##  4     2 a           2.5 
##  5     2 d           3   
##  6     2 e           2.75
##  7     3 a           3   
##  8     3 d           2   
##  9     3 e           3.25
## 10     4 a           3.25

Why does the mean() function work in this way?

Great! Now, we need to move the facets into columns using pivot_wider(), which will widen the data frame from the long format we transformed it into.

Then, we can use mutate() to get participant’s overall Neuroticism composite score by getting the mean of the 3 facet means, per person:

Change this around

composite_final <- facet_long_4 |>
  pivot_wider(
    id_cols = id, #specifying the id column that identifies each observation
    names_from = facet,
    values_from = facet_comp
  ) |> #can use pipe here to go into the next operation
  group_by(id) |>
  mutate(
    Neuro_comp = rowMeans(across(c(a:e)), na.rm = TRUE)
  ) |>
   #we can change the names to be more informative
   rename(
     Anxiety = a,
     Depression = d,
     Volatility = e
   )

print(composite_final[1:10, ])
## # A tibble: 10 × 5
## # Groups:   id [10]
##       id Anxiety Depression Volatility Neuro_comp
##    <dbl>   <dbl>      <dbl>      <dbl>      <dbl>
##  1     1    1.75       1          1          1.25
##  2     2    2.5        3          2.75       2.75
##  3     3    3          2          3.25       2.75
##  4     4    3.25       2.5        2.75       2.83
##  5     5    3.75       2.25       2          2.67
##  6     6    3          3.25       2.75       3   
##  7     7    2.75       2          1.75       2.17
##  8     8    3.75       3          2          2.92
##  9     9    4          4.25       3.5        3.92
## 10    10    3.5        3.25       3.75       3.5

You try!

How would I pivot_longer() this final data frame, with the facet columns and trait column together? Call this long_final.

Massive shout out to the Fall 2025 AI Derek Simon for creating this excellent Practice Set!