Practice Set: Regression

This time we will work with the palmer penguins dataset.

If you want to install this on your own computer, you can run the following code. It is already installed for the purposes of this website:

install.packages("palmerpenguins")

Let’s take a look at our dataset!

# store it as an object so we can work with it
penguins <- palmerpenguins::penguins
kable(head(penguins))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

Continuous Predictor

You try!

Write a regression model and print the output for body mass in penguins predicted by bill length

Let’s interpret those coefficients!

  • What does the estimate for the intercept mean?
  • What does the bill length estimate mean?
  • How could you calculate the \(t\)-statistic if it was not part of this output?

You try!

Now we’re going to do the following:

  • Make the output prettier
  • Add on extra stuff, like predicted values
  • Look at fit statistics

What if our units are hard to interpret? Maybe we want to examine things in the context of standard deviation. No problem! We can standardize things!

mod2 <- lm(scale(body_mass_g) ~ scale(bill_length_mm),
           data = penguins)
summary(mod2)
## 
## Call:
## lm(formula = scale(body_mass_g) ~ scale(bill_length_mm), data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.19724 -0.55736  0.04064  0.57648  2.04109 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.793e-16  4.352e-02    0.00        1    
## scale(bill_length_mm) 5.951e-01  4.358e-02   13.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8048 on 340 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.3542, Adjusted R-squared:  0.3523 
## F-statistic: 186.4 on 1 and 340 DF,  p-value: < 2.2e-16

For a 1-SD increase in bill length, we expect weight to increase by \(b1\).

Categorical Predictor

Now what happens when we enter a categorical variable as a predictor?

You try!

Create a new model called mod3 that looks at how body mass differs by whether penguins are male and female

Let’s interpret those coefficients! - What does the intercept mean? - What does the slope mean?

Let’s get more familiar with OLS estimation and play a fun guessing game

https://sophieehill.shinyapps.io/eyeball-regression/

Reporting Regression

There are many ways of reporting your models - R conveniently has some packages that make publishable tables for you. Here is an example

#install.packages("apaTables")
mod1 <- lm(body_mass_g ~ bill_length_mm, data = penguins)
apaTables::apa.reg.table(mod1)
## 
## 
## Regression results using body_mass_g as the criterion
##  
## 
##       Predictor       b          b_95%_CI beta  beta_95%_CI sr2 sr2_95%_CI
##     (Intercept)  362.31 [-195.02, 919.64]                                 
##  bill_length_mm 87.42**   [74.82, 100.01] 0.60 [0.51, 0.68] .35 [.28, .42]
##                                                                           
##                                                                           
##                                                                           
##      r             Fit
##                       
##  .60**                
##            R2 = .354**
##        95% CI[.28,.42]
##                       
## 
## Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
## b represents unstandardized regression weights. beta indicates the standardized regression weights. 
## sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
## Square brackets are used to enclose the lower and upper limits of a confidence interval.
## * indicates p < .05. ** indicates p < .01.
## 
sjPlot::tab_model(mod1)
  body_mass_g
Predictors Estimates CI p
(Intercept) 362.31 -195.02 – 919.64 0.202
bill length mm 87.42 74.82 – 100.01 <0.001
Observations 342
R2 / R2 adjusted 0.354 / 0.352

Something that was missing from the lm output – Confidence Intervals. Journals often want them reported so here is a quick way of getting them (if not using a table package that gets them for you automatically)

confint(mod1)
##                     2.5 %   97.5 %
## (Intercept)    -195.02364 919.6371
## bill_length_mm   74.82279 100.0078

When plotting using ggplot2 you’ll also get a confidence band

ggplot(penguins, aes(y = body_mass_g, x = bill_length_mm)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Confidence Band",
       x = "Bill Length (mm)",
       y = "Body Mass (g)")

What do you notice in this plot?

Lastly, confidence bands are a little different from prediction bands. A confidence band is basically your confidence interval around the regression line. A prediction band:

  • We are predicting and individual score, not the \(\hat{Y}\) for a particular level of \(X\)
  • Because there is greater variation in predicting an individual value rather than a collection of individual values (i.e., the mean) the prediction band is greater
penguinsSmall = penguins %>% 
  select(body_mass_g, bill_length_mm) %>% 
  drop_na()

temp_var <- predict(mod1, interval = "prediction")
new_df <- cbind(penguinsSmall, temp_var)
ggplot(new_df, aes(y = body_mass_g,
                           x = bill_length_mm)) +
  geom_point() +
  geom_smooth(method = lm, se = TRUE) +
  geom_ribbon(aes(ymin = lwr, ymax = upr),
              fill = "blue", alpha = 0.1) +
  theme_classic() +
  labs(title = "Prediction Band vs. Confidence Band",
       x = "Bill Length (mm)",
       y = "Body Mass (g)")

Massive shout out to the Sping 2022 AI Tabea Springstein for creating this excellent Practice Set!