class: center, middle, inverse, title-slide # Tidyr --- # Recap - `tidyverse` is an *opinionated* collection of pacakges - All packages within it's ecosystem use the same syntax: - `%>%` pipe operators at the end of the line read as *"and then"* > "I took my original data.frame %>% > I kept only 5 out of the original 20 columns %>% > I added a new column that was based on the 2nd column %>% > I grouped the data based on a categorical column %>% > I got descriptive statistics per level of the categorical var" --- # Recap - `tidyverse` is an *opinionated* collection of pacakges - All packages within it's ecosystem use the same syntax: - `%>%` pipe operators at the end of the line read as *"and then"* > `originalData %>%` > `select(1:5) %>% ` > `mutate(newVar = sqrt(var1)) %>% ` > `group_by(factorVar) %>%` > `summarize(meanVar = mean(var))` --- # This time Functions from the `tidyr` package (but DON'T memorize which functions come from which packages!) - Go from long to wide format - Split columns and combine them - Missing data --- name: lw # Long vs. Wide data **Long data** - Each column is a variable and each row is an observation. Each row does NOT need to be a unique participant. **Wide data** - Each row is a particular participant, and columns can contain multiple observations for the same data. .pull-left[ <img src="12-slides_files/figure-html/widelong.png", width = "40%"> .tiny[https://swcarpentry.github.io/r-novice-gapminder/14-tidyr/ ] ] -- .pull-right[ ``` ## Subject Time1 Time2 Time3 ## 1 1 0.2 0.4 0.3 ## 2 2 0.8 0.9 0.7 ## 3 3 1.3 1.0 1.1 ``` ``` ## Subject TimePoint Score ## 1 1 1 0.2 ## 2 2 1 0.8 ## 3 3 1 1.3 ## 4 1 2 0.4 ## 5 2 2 0.9 ## 6 3 2 1.0 ## 7 1 3 0.3 ## 8 2 3 0.7 ## 9 3 3 1.1 ``` ] --- # Long vs. Wide data For the most part, you want your data to be in the **long** format - Especially for plotting in `ggplot2`! - (some analyses, like reliability, require the wide format, but most stick with long) However, we often receive data in the wide format. It is useful to be able to go between the two. `tidyr` makes this easy with: - `pivot_wider()` to go from long to wide - `pivot_longer()` to go from wide to long --- name: pw # `pivot_wider()` function This function takes in long data and makes it wide. Important arguments: - `names_from = ` which columns to get the *name* of the output column. - `values_from = ` which columns to get the *value* of the output column. --- # `pivot_wider()` function Let's take the example data.frame I showed earlier. Since it's completely arbitrary, I'm going to call it `generic` ```r generic ``` ``` ## Subject TimePoint Score ## 1 1 1 0.2 ## 2 2 1 0.8 ## 3 3 1 1.3 ## 4 1 2 0.4 ## 5 2 2 0.9 ## 6 3 2 1.0 ## 7 1 3 0.3 ## 8 2 3 0.7 ## 9 3 3 1.1 ``` --- # `pivot_wider()` function This `generic` data.frame is in the **long** format. To make it into the **wide** format, let's use `pivot_wider()` ```r wideGeneric <- generic %>% pivot_wider(names_from = TimePoint, values_from = Score) wideGeneric ``` ``` ## # A tibble: 3 x 4 ## Subject `1` `2` `3` ## <dbl> <dbl> <dbl> <dbl> ## 1 1 0.2 0.4 0.3 ## 2 2 0.8 0.9 0.7 ## 3 3 1.3 1 1.1 ``` --- # `pivot_wider()` function Sometimes, it's a bit more complicated. Let's add some more variables to `generic` to test this out. - `hairColor` factor with 2 levels (brown & blonde) - `happiness` scale of 1 to 10 measured at each time point ```r generic <- generic %>% mutate(hairColor = rep(c("brown", "blonde", "blonde"), times = 3), happiness = c(10, 2, 6, 9, 2, 5, 10, 3, 4)) generic ``` ``` ## Subject TimePoint Score hairColor happiness ## 1 1 1 0.2 brown 10 ## 2 2 1 0.8 blonde 2 ## 3 3 1 1.3 blonde 6 ## 4 1 2 0.4 brown 9 ## 5 2 2 0.9 blonde 2 ## 6 3 2 1.0 blonde 5 ## 7 1 3 0.3 brown 10 ## 8 2 3 0.7 blonde 3 ## 9 3 3 1.1 blonde 4 ``` --- # `pivot_wider()` function Now, let's say we want each time point's `Score` and `happiness` variables in the wide format... ```r wideGenericMore <- generic %>% pivot_wider(names_from = TimePoint, values_from = c(Score, happiness)) wideGenericMore ``` ``` ## # A tibble: 3 x 8 ## Subject hairColor Score_1 Score_2 Score_3 happiness_1 happiness_2 happiness_3 ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 brown 0.2 0.4 0.3 10 9 10 ## 2 2 blonde 0.8 0.9 0.7 2 2 3 ## 3 3 blonde 1.3 1 1.1 6 5 4 ``` --- name: pl # `pivot_longer()` function The exact opposite of `pivot_wider()` is `pivot_longer`. This takes a wide data.frame and makes it into a **long** data.frame. Arguments are now `names_to =` and `values_to =`. You also need to include a `cols =` argument to say which columns you want into the longer format. Before doing this with code, here's a schematic that might be helpful: <img src="12-slides_files/figure-html/tx.png", width = "100%"> .tiny[https://swcarpentry.github.io/r-novice-gapminder/14-tidyr/] --- # `pivot_longer()` function Let's keep going with our current example, starting from `wideGenericMore` .medium[] ```r longGeneric <- wideGenericMore %>% pivot_longer(cols = 3:8, names_to = "valueType", values_to = "allScores") longGeneric ``` ``` ## # A tibble: 18 x 4 ## Subject hairColor valueType allScores ## <dbl> <chr> <chr> <dbl> ## 1 1 brown Score_1 0.2 ## 2 1 brown Score_2 0.4 ## 3 1 brown Score_3 0.3 ## 4 1 brown happiness_1 10 ## 5 1 brown happiness_2 9 ## 6 1 brown happiness_3 10 ## 7 2 blonde Score_1 0.8 ## 8 2 blonde Score_2 0.9 ## 9 2 blonde Score_3 0.7 ## 10 2 blonde happiness_1 2 ## 11 2 blonde happiness_2 2 ## 12 2 blonde happiness_3 3 ## 13 3 blonde Score_1 1.3 ## 14 3 blonde Score_2 1 ## 15 3 blonde Score_3 1.1 ## 16 3 blonde happiness_1 6 ## 17 3 blonde happiness_2 5 ## 18 3 blonde happiness_3 4 ``` ] --- # `pivot_longer()` function For both of these `pivot` functions, you can use the `-` (minus) sign to say "everything except this column". For example: ```r longGeneric <- wideGenericMore %>% pivot_longer(cols = c(-hairColor, -Subject), names_to = "valueType", values_to = "allScores") longGeneric ``` ``` ## # A tibble: 18 x 4 ## Subject hairColor valueType allScores ## <dbl> <chr> <chr> <dbl> ## 1 1 brown Score_1 0.2 ## 2 1 brown Score_2 0.4 ## 3 1 brown Score_3 0.3 ## 4 1 brown happiness_1 10 ## 5 1 brown happiness_2 9 ## 6 1 brown happiness_3 10 ## 7 2 blonde Score_1 0.8 ## 8 2 blonde Score_2 0.9 ## 9 2 blonde Score_3 0.7 ## 10 2 blonde happiness_1 2 ## 11 2 blonde happiness_2 2 ## 12 2 blonde happiness_3 3 ## 13 3 blonde Score_1 1.3 ## 14 3 blonde Score_2 1 ## 15 3 blonde Score_3 1.1 ## 16 3 blonde happiness_1 6 ## 17 3 blonde happiness_2 5 ## 18 3 blonde happiness_3 4 ``` --- # The `pivot` functions Some things to notice: - In `pivot_longer`, the arguments take in strings (aka, need quotations!). That's because you need to tell R what to name something. - In `pivot_wider`, the arguments take in variable names that already exist. So you do not need to wrap those in quotation marks. - These are the types of functions that I mess up ALL. THE. TIME. Use your History tab! --- name: sep # `separate()` function In our latest iteration, `longGeneric`, we have a column called `valueType` where it is a name, then an underscore (`_`), and a number, ex: `Score_1`. We can use `separate()` to make split `valueType` into 2 separate columns...1 for the `Score` and another for the `1`. ```r longGeneric %>% separate(col = valueType, into = c("variableName", "timePoint")) ``` ``` ## # A tibble: 18 x 5 ## Subject hairColor variableName timePoint allScores ## <dbl> <chr> <chr> <chr> <dbl> ## 1 1 brown Score 1 0.2 ## 2 1 brown Score 2 0.4 ## 3 1 brown Score 3 0.3 ## 4 1 brown happiness 1 10 ## 5 1 brown happiness 2 9 ## 6 1 brown happiness 3 10 ## 7 2 blonde Score 1 0.8 ## 8 2 blonde Score 2 0.9 ## 9 2 blonde Score 3 0.7 ## 10 2 blonde happiness 1 2 ## 11 2 blonde happiness 2 2 ## 12 2 blonde happiness 3 3 ## 13 3 blonde Score 1 1.3 ## 14 3 blonde Score 2 1 ## 15 3 blonde Score 3 1.1 ## 16 3 blonde happiness 1 6 ## 17 3 blonde happiness 2 5 ## 18 3 blonde happiness 3 4 ``` --- # `separate()` function Note that I did not specify that I wanted to separate based on the underscore. - When it is simple like this, R can automatically detect it. - But if it's a bit trickier, you can specify how to separate in the `sep =` argument. - For example, `sep = ": "` if you want to separate on a colon + space. --- name: un # `unite()` function The opposite of separate is `unite()`. For instance, let's say we want to create a variable called `bogus` that looks something like `brown: Score` or `blonde: happiness`. The separator is a colon + space. ```r longGeneric %>% unite(col = "bogus", hairColor, valueType, sep = ": ") ``` ``` ## # A tibble: 18 x 3 ## Subject bogus allScores ## <dbl> <chr> <dbl> ## 1 1 brown: Score_1 0.2 ## 2 1 brown: Score_2 0.4 ## 3 1 brown: Score_3 0.3 ## 4 1 brown: happiness_1 10 ## 5 1 brown: happiness_2 9 ## 6 1 brown: happiness_3 10 ## 7 2 blonde: Score_1 0.8 ## 8 2 blonde: Score_2 0.9 ## 9 2 blonde: Score_3 0.7 ## 10 2 blonde: happiness_1 2 ## 11 2 blonde: happiness_2 2 ## 12 2 blonde: happiness_3 3 ## 13 3 blonde: Score_1 1.3 ## 14 3 blonde: Score_2 1 ## 15 3 blonde: Score_3 1.1 ## 16 3 blonde: happiness_1 6 ## 17 3 blonde: happiness_2 5 ## 18 3 blonde: happiness_3 4 ``` --- name: missing # Missing values in `tidyverse` - Like `base R` and others, many `tidyverse` functions have an argument for `na.rm =`. - You can add a `drop_na()` function to your `tidyverse` chunk. This function is part of `tidyr` and it will get rid of any rows that contain missing values. It's the equivalent of `na.omit()` - Do everything in your power to make sure missing values are treated as `NA` and *not* something else. Ex: - `999` -- Many measurements can have a value of 999... - `" "` -- Spaces are treated as a character string, not truly missing! Remember, the class of your object is based on the least specific object. So if you have a vector of integers, but one missing value that is `" "`, the class of your vector will be a character! Same thing goes for `.` (periods). - If you have something like `999` and you want to replace that with an `NA`, either of the following will work: - `data[data == 999] <- NA` (for the entire dataset) - `data$column[data$column == 999] <- NA` (for a single column) - `data <- gsub(pattern = 999, replacement = NA, x = data)` (but this will find anything with 999, so be careful!)