Dplyr

class: center, middle, inverse, title-slide

# Dplyr

---

# What is the `tidyverse`?

.pull-left[
<img src="11-slides_files/figure-html/tidyverse.png", width = "100%">
]

.pull-right[
> "The `tidyverse` is an **opinionated** collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures."

]
---
# Plan for today

- Learn basic syntax for nearly all `tidyverse` packages
- Introduce functions that come from the `dplyr` package
  - `filter()`
  - `select()`
  - `mutate()`
  - `summarize()`
  - `group_by()`
---

# About the MIDUS dataset

Variables available in this data file:
  - **Demographic variables**: age, sex
  - **Physical health variables**: self-rated physical health, heart problems, father had heart attack, BMI
  - **Mental health variables**: self-rated meantal health, self-esteem, life satisfaction *(life overall, work, health, relationship with spouse/partner, relationship with children)*, hostility *(stress reactivity & agression*)

Please load in `midus`, make sure:
- Make sure the variables `sex`, `heart_self`, and `heart_father` are `factor()` variables (rather than characters)
- Use the same `na.omit()` function to remove all `NA` values
---
name: syntax

# Syntax & Piping

- All of the `tidyverse` packages use **piping** as a way to make code easier to read.

- Think of it kind of like making a cohesive paragraph of code, rather than scribbling down a bunch of random lines.

- The format looks like this:

```r
originalData %>% 
  function1(someVariable) %>% 
  function2(someVariable) %>% 
  function3(someVariable)
```

---
# Syntax & Piping

```r
*originalData %>%
  function1(someVariable) %>% 
  function2(someVariable) %>% 
  function3(someVariable)
```

First thing that enters is your original data.frame. The end of the line has this `%>%` symbol. This is called a **pipe**. 
---
# Syntax & Piping

```r
originalData %>% 
* function1(someVariable) %>%
  function2(someVariable) %>% 
  function3(someVariable)
```

Next up is some function that is performed on a variable. This variable COMES FROM the `originalData` data.frame. Another way to think about it is that the function *inherits* the data.frame from above. That means you don't need to keep re-typing `originalData`.

Again, the end of the line is followed by the `%>%` pipe operator. 
---
# Syntax & Piping

```r
originalData %>% 
  function1(someVariable) %>% 
* function2(someVariable) %>%
  function3(someVariable)
```

Same thing for the next function. However, instead of inheriting from `originalData`, function 2 will inherit *the output* of function 1!

Again, the end of the line is followed by the `%>%` pipe operator.

---
# Syntax & Piping

```r
originalData %>% 
  function1(someVariable) %>% 
  function2(someVariable) %>% 
* function3(someVariable)
```

Finally, we get to function 3. It will inherit *the output* of function 2.

Notice that there is no `%>%` pipe operator at the end of this line. That's because this "paragraph" of code is now over.
---
# Syntax & Piping
- These `%>%` pipes are used to perform **SEQUENTIAL** tasks!
- You can read the `%>%` as *and then...*

- Don't use `<-` *inside* the piped function. Only at the very beginning if you want to store the output.
- Keep `%>%` and the *end* of each line! Not at the beginning. 
- Shortcut for inserting pipe:
  - <kbd>command</kbd> + <kbd>shift</kbd> + <kbd>m</kbd> for Mac users 
  - <kbd>control</kbd> + <kbd>shift</kbd> + <kbd>m</kbd> for Windows users

---
name: filter

# `filter()` Function
To illustrate how this works, let's start with the `filter()` function. `filter()` is another way to subset your data.frame based on some condition. It is the `tidyverse` equivalent of `subset()`.

Let's say we want to make a new data.frame that included only female participants...

```r
femaleMidus <- midus %>% 
  filter(sex == "Female")
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> ID </th>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> age </th>
   <th style="text-align:right;"> BMI </th>
   <th style="text-align:right;"> physical_health_self </th>
   <th style="text-align:right;"> mental_health_self </th>
   <th style="text-align:right;"> self_esteem </th>
   <th style="text-align:right;"> life_satisfaction </th>
   <th style="text-align:right;"> hostility </th>
   <th style="text-align:left;"> heart_self </th>
   <th style="text-align:left;"> heart_father </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 10011 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 52 </td>
   <td style="text-align:right;"> 25.991 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 41 </td>
   <td style="text-align:right;"> 7.000 </td>
   <td style="text-align:right;"> 5.5 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10015 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 53 </td>
   <td style="text-align:right;"> 32.121 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 31 </td>
   <td style="text-align:right;"> 7.375 </td>
   <td style="text-align:right;"> 6.0 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10023 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 78 </td>
   <td style="text-align:right;"> 24.752 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 34 </td>
   <td style="text-align:right;"> 6.500 </td>
   <td style="text-align:right;"> 4.5 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10028 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 63 </td>
   <td style="text-align:right;"> 24.049 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 42 </td>
   <td style="text-align:right;"> 8.875 </td>
   <td style="text-align:right;"> 4.5 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10030 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 56 </td>
   <td style="text-align:right;"> 27.342 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 37 </td>
   <td style="text-align:right;"> 8.750 </td>
   <td style="text-align:right;"> 5.0 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10038 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 57 </td>
   <td style="text-align:right;"> 39.598 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 26 </td>
   <td style="text-align:right;"> 7.125 </td>
   <td style="text-align:right;"> 9.5 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
  </tr>
</tbody>
</table>
---
# Spelling/capitalization etc. always count
Let's say we want to make a new data.frame that included only female participants...

```r
femaleMidus <- midus %>% 
  filter(sex == "female")
```

</tr>
</tbody>
</table>
---

# Now with multiple logical operators
Let's say we want to make a new data.frame that included  male participants who have reported having some form of heart problem and are over the age of 50.

```r
oldMenHeart <- midus %>% 
  filter(sex == "Male" & heart_self == "Yes" & age > 50)
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> ID </th>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> age </th>
   <th style="text-align:right;"> BMI </th>
   <th style="text-align:right;"> physical_health_self </th>
   <th style="text-align:right;"> mental_health_self </th>
   <th style="text-align:right;"> self_esteem </th>
   <th style="text-align:right;"> life_satisfaction </th>
   <th style="text-align:right;"> hostility </th>
   <th style="text-align:left;"> heart_self </th>
   <th style="text-align:left;"> heart_father </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 10039 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 53 </td>
   <td style="text-align:right;"> 31.872 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 35 </td>
   <td style="text-align:right;"> 7.000 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10067 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 62 </td>
   <td style="text-align:right;"> 29.254 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 36 </td>
   <td style="text-align:right;"> 7.625 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10088 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 79 </td>
   <td style="text-align:right;"> 29.289 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 34 </td>
   <td style="text-align:right;"> 8.250 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10131 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 71 </td>
   <td style="text-align:right;"> 24.826 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 43 </td>
   <td style="text-align:right;"> 8.000 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10143 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 57 </td>
   <td style="text-align:right;"> 25.105 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 35 </td>
   <td style="text-align:right;"> 8.667 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10173 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 58 </td>
   <td style="text-align:right;"> 28.481 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 49 </td>
   <td style="text-align:right;"> 9.500 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
  </tr>
</tbody>
</table>

---
# Is `tidyverse` totally different from `base R`?

**No!** You still have:
  - objects
  - assignment of objects
  - functions
  - functions that take in arguments
  - logical operators like `==` and `>`
  - multiple logical operators like `&` and `|`

The only thing that's different is the inclusion of `%>%` and the way you build your "code paragraphs". But all of the principles that we've learned thus far, still apply to everything in the `tidyverse`.
---
name: select

# `select()` function
This is another way to select variables. It can replace indexing, which is helpful when you are in these `tidyverse` code chunks (or paragraphs).

This function can take in column indexes, variable names, or both!

```r
# first 3 columns only!
firstThree <- midus %>% 
  select(1:3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> ID </th>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> age </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:right;"> 10001 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 61 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:right;"> 10002 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 69 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 6 </td>
   <td style="text-align:right;"> 10011 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 52 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 </td>
   <td style="text-align:right;"> 10015 </td>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 53 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 10 </td>
   <td style="text-align:right;"> 10018 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 49 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 11 </td>
   <td style="text-align:right;"> 10019 </td>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 51 </td>
  </tr>
</tbody>
</table>
---
# `select()` function

```r
# BMI, both heart_self and heart_father
otherThree <- midus %>% 
  select(BMI, 10:11)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> BMI </th>
   <th style="text-align:left;"> heart_self </th>
   <th style="text-align:left;"> heart_father </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:right;"> 26.263 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:right;"> 24.077 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 6 </td>
   <td style="text-align:right;"> 25.991 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 </td>
   <td style="text-align:right;"> 32.121 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 10 </td>
   <td style="text-align:right;"> 22.499 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 11 </td>
   <td style="text-align:right;"> 29.987 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
</tbody>
</table>
---

# `select()` function
To remove a variable, put a `-` (minus) sign in front of the variable you want to get rid of

```r
# Keep all variables EXCEPT sex & physical_health_self
removal <- midus %>% 
  select(-sex, -5)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> ID </th>
   <th style="text-align:right;"> age </th>
   <th style="text-align:right;"> BMI </th>
   <th style="text-align:right;"> mental_health_self </th>
   <th style="text-align:right;"> self_esteem </th>
   <th style="text-align:right;"> life_satisfaction </th>
   <th style="text-align:right;"> hostility </th>
   <th style="text-align:left;"> heart_self </th>
   <th style="text-align:left;"> heart_father </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:right;"> 10001 </td>
   <td style="text-align:right;"> 61 </td>
   <td style="text-align:right;"> 26.263 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 42 </td>
   <td style="text-align:right;"> 7.750 </td>
   <td style="text-align:right;"> 5.5 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:right;"> 10002 </td>
   <td style="text-align:right;"> 69 </td>
   <td style="text-align:right;"> 24.077 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 34 </td>
   <td style="text-align:right;"> 8.250 </td>
   <td style="text-align:right;"> 6.0 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 6 </td>
   <td style="text-align:right;"> 10011 </td>
   <td style="text-align:right;"> 52 </td>
   <td style="text-align:right;"> 25.991 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 41 </td>
   <td style="text-align:right;"> 7.000 </td>
   <td style="text-align:right;"> 5.5 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 </td>
   <td style="text-align:right;"> 10015 </td>
   <td style="text-align:right;"> 53 </td>
   <td style="text-align:right;"> 32.121 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 31 </td>
   <td style="text-align:right;"> 7.375 </td>
   <td style="text-align:right;"> 6.0 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 10 </td>
   <td style="text-align:right;"> 10018 </td>
   <td style="text-align:right;"> 49 </td>
   <td style="text-align:right;"> 22.499 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 41 </td>
   <td style="text-align:right;"> 8.500 </td>
   <td style="text-align:right;"> 6.0 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 11 </td>
   <td style="text-align:right;"> 10019 </td>
   <td style="text-align:right;"> 51 </td>
   <td style="text-align:right;"> 29.987 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 38 </td>
   <td style="text-align:right;"> 7.625 </td>
   <td style="text-align:right;"> 4.5 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
  </tr>
</tbody>
</table>
---
name: mutate

# `mutate()` function
`mutate()` is kind of tricky. On it's own, will simply add a new variable to the end of your data.frame based on something.

For example, if we wanted to get the square root of BMI...

```r
sqrtMidus <- midus %>% 
  mutate(BMI_sqrt = sqrt(BMI))

head(sqrtMidus)
```

```
##      ID    sex age    BMI physical_health_self mental_health_self self_esteem
## 1 10001   Male  61 26.263                    2                  4          42
## 2 10002   Male  69 24.077                    5                  5          34
## 3 10011 Female  52 25.991                    5                  4          41
## 4 10015 Female  53 32.121                    3                  3          31
## 5 10018   Male  49 22.499                    4                  4          41
## 6 10019   Male  51 29.987                    4                  5          38
##   life_satisfaction hostility heart_self heart_father BMI_sqrt
## 1             7.750       5.5         No           No 5.124744
## 2             8.250       6.0         No          Yes 4.906832
## 3             7.000       5.5         No           No 5.098137
## 4             7.375       6.0         No          Yes 5.667539
## 5             8.500       6.0         No           No 4.743311
## 6             7.625       4.5         No           No 5.476039
```
---
# `mutate()` function
BUT, you can add different endings (suffixes) to it
  - `mutate_at()`
  - `mutate_all()`
  - `mutate_if()`

I find `mutate_at()` to be the most useful, personally. It is especially nice for making sure the variables you need to be factors are actually factors!

.small[Note: you can add suffixes `_at`, `_all`, and `_if` to many `tidyverse` functions! `mutate()` happens to be the one where I find this most useful, so I'm using it as an example.]
---
# `mutate()` function
For example, to set up the `midus` data.frame, you were asked to make sure that `sex`, `heart_self`, and `heart_father` were all considered factors. Your code probably looked something like:

```r
midus$sex <- factor(midus$sex)
midus$heart_self <- factor(midus$heart_self)
midus$heart_father <- factor(midus$heart_father)
```
--
When instead, it could look something like this:

```r
midus <- midus %>% 
  mutate_at(vars(2, 10, 11), list(factor))
```

- `vars(2, 10, 11)` says "OK, I'm going to mutate some variables. Which ones?" 
- `list(factor)` says, "give me a list of functions you want me to apply to each of the variables you fed me"

.tiny[Note: I have found that the help documentation for some of these functions has not updated accordingly. Search the internet and pay attention to your package version number.]
---
# THERE IS NO RIGHT WAY TO CODE!
Whether you used this...

```r
midus$sex <- factor(midus$sex)
midus$heart_self <- factor(midus$heart_self)
midus$heart_father <- factor(midus$heart_father)
```

...or this...

```r
midus <- midus %>% 
  mutate_at(vars(2, 10, 11), list(factor))
```

....**doesn't matter at all!** The only things that count are:
  - Were you able to do what you wanted to? 
  - Can YOU read the code and know what it's doing?
  - Can SOMEONE ELSE read the code and know what it's doing?
  
---

# A `filter()` & `mutate_at()` example

Let's say we `filter()` so that we only have females in our data.set.

```r
femalesOnly <- midus %>% 
  filter(sex == "Female")
```

In our new data.frame, the variable `sex` should only have 1 level for "Female". That is, all the "Male" participants have been removed. So as a factor, there should only be 1 category or 1 level. Let's check:

```r
levels(femalesOnly$sex)
```

```
## [1] "Female" "Male"
```

Uh oh! That's not quite right.

---

# A `filter()` & `mutate_at()` example

Let's tell R to make `sex` into a factor again (kind of like re-populate the variable).

```r
femalesOnly <- midus %>% 
  filter(sex == "Female") %>% 
  mutate_at(vars(sex), list(factor))

# check the levels again
levels(femalesOnly$sex)
```

```
## [1] "Female"
```

Now we got it! You could have first done the `filter()` function, ended the code chunk/paragraph, and then typed:
`femalesOnly$sex <- factor(femalesOnly$sex)`. The downside to this is that it's nice to keep all your functions (verbs/actions) in one place, if you can.
---
name: summary

# `summarize()` function

This is great for summarizing your data *(shocking, I know 😮)*

Remember that awfulness for making bar plots? This is how we can do it easily!

```r
midus %>% 
  summarize(meanAge = mean(age))
```

```
##    meanAge
## 1 56.09118
```

---
# `summarize()` function
You can go crazy with this!

```r
midus %>% 
  summarize(meanAge = mean(age), # mean
            sdAge = sd(age), # standard deviation
            varAge = var(age), # variance
            medianAge = median(age)) # median
```

```
##    meanAge    sdAge   varAge medianAge
## 1 56.09118 12.30031 151.2976        55
```

.box-inv-4.small[Fun fact: the person that wrote much of the `tidyverse` packages is from New Zealand, where they use British spellings. Therefore, `summarise()` is the exact same thing as `summarize()`. Your tab-complete might fill in the British versions!]

---
name: group

# `group_by()` function
We can make `summarize()` even more powerful by adding the `group_by()` function.

You will NOT see anything directly change to your data.frame if you were to just run this factor. However, on the back end (behind the scenes), it tells R to do something *for each level of a categorical variable*.

If we want the mean age of those with and without heart problems:

```r
midus %>% 
  group_by(heart_self) %>% 
  summarize(meanAge = mean(age))
```

```
## # A tibble: 2 x 2
##   heart_self meanAge
##   <fct>        <dbl>
## 1 No            54.6
## 2 Yes           63.0
```
---
# `group_by()` function
We can go crazy with this too!

```r
midus %>% 
  group_by(heart_self, sex) %>% 
  summarize(meanAge = mean(age),
            sdAge = sd(age),
            meanBMI = mean(BMI),
            sdBMI = sd(BMI))
```

```
## # A tibble: 4 x 6
## # Groups:   heart_self [2]
##   heart_self sex    meanAge sdAge meanBMI sdBMI
##   <fct>      <fct>    <dbl> <dbl>   <dbl> <dbl>
## 1 No         Female    54.9  12.3    27.5  6.42
## 2 No         Male      54.3  11.5    28.2  4.74
## 3 Yes        Female    61.2  11.8    28.0  6.60
## 4 Yes        Male      64.6  11.1    28.9  4.92
```

---
name: rand

# Pro Tips
As you can see, the suite of `tidyverse` packages can be really, really helpful! Some things to keep in mind:
  - You can put a non-tidyverse function into one of these code chunks (paragraphs)
    - If you do this, you sometimes need to give the function an input argument. Use the `.` for this. 
    - Ex: `midus %>% na.omit(.)`
  - You can have as many functions in each paragraph as you want. Just remember that everything is *sequential*!
    - If the output of your paragraph isn't what you think it should be, go line by line until you find the problem. Do NOT include the `%>%` when you run the line of code, though! R will wait for you to finish your sentence...
---

# Other useful `dplyr` functions

- `recode()` is great for recoding variables. I especially like this for when you have something like `1` and `2` reflecting categorical variables. Recode them into something more meaningful! This is often nested within a `mutate()` or `mutate_at()` function.
  
  - `rename()` for renaming columns
  
  - `arrange()` will order the rows of a data.frame by some column. 
  
  - `n_distinct()` finds the number of unique entries. For example, if you have "male" and "female", the result of `n_distinct()` should be 2, even if there are thousands of rows. Now let's say there's a spelling error in one of these rows (e.g., "feemale"), now the result of `n_distinct()` will be 3...that should let you know there's a problem. 
  
  - lots & lots of others...