Practice Set: Regression

Continuous Predictor
Categorical Predictor
Reporting Regression

This time we will work with the palmer penguins dataset.

If you want to install this on your own computer, you can run the following code. It is already installed for the purposes of this website:

install.packages("palmerpenguins")

Let’s take a look at our dataset!

# store it as an object so we can work with it
penguins <- palmerpenguins::penguins
kable(head(penguins))

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

Continuous Predictor

You try!

Write a regression model and print the output for body mass in penguins predicted by bill length

Let’s interpret those coefficients!

What does the estimate for the intercept mean?
What does the bill length estimate mean?
How could you calculate the \(t\)-statistic if it was not part of this output?

You try!

Now we’re going to do the following:

Make the output prettier
Add on extra stuff, like predicted values
Look at fit statistics

What if our units are hard to interpret? Maybe we want to examine things in the context of standard deviation. No problem! We can standardize things!

mod2 <- lm(scale(body_mass_g) ~ scale(bill_length_mm),
           data = penguins)
summary(mod2)

## 
## Call:
## lm(formula = scale(body_mass_g) ~ scale(bill_length_mm), data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.19724 -0.55736  0.04064  0.57648  2.04109 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.793e-16  4.352e-02    0.00        1    
## scale(bill_length_mm) 5.951e-01  4.358e-02   13.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8048 on 340 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.3542, Adjusted R-squared:  0.3523 
## F-statistic: 186.4 on 1 and 340 DF,  p-value: < 2.2e-16

For a 1-SD increase in bill length, we expect weight to increase by \(b1\).

Categorical Predictor

Now what happens when we enter a categorical variable as a predictor?

You try!

Create a new model called mod3 that looks at how body mass differs by whether penguins are male and female

Let’s interpret those coefficients! - What does the intercept mean? - What does the slope mean?

Let’s get more familiar with OLS estimation and play a fun guessing game

https://sophieehill.shinyapps.io/eyeball-regression/

Reporting Regression

There are many ways of reporting your models - R conveniently has some packages that make publishable tables for you. Here is an example

#install.packages("apaTables")
mod1 <- lm(body_mass_g ~ bill_length_mm, data = penguins)
apaTables::apa.reg.table(mod1)

## 
## 
## Regression results using body_mass_g as the criterion
##  
## 
##       Predictor       b          b_95%_CI beta  beta_95%_CI sr2 sr2_95%_CI
##     (Intercept)  362.31 [-195.02, 919.64]                                 
##  bill_length_mm 87.42**   [74.82, 100.01] 0.60 [0.51, 0.68] .35 [.28, .42]
##                                                                           
##                                                                           
##                                                                           
##      r             Fit
##                       
##  .60**                
##            R2 = .354**
##        95% CI[.28,.42]
##                       
## 
## Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
## b represents unstandardized regression weights. beta indicates the standardized regression weights. 
## sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
## Square brackets are used to enclose the lower and upper limits of a confidence interval.
## * indicates p < .05. ** indicates p < .01.
##

sjPlot::tab_model(mod1)

	body_mass_g
Predictors	Estimates	CI	p
(Intercept)	362.31	-195.02 – 919.64	0.202
bill length mm	87.42	74.82 – 100.01	<0.001
Observations	342
R² / R² adjusted	0.354 / 0.352

Something that was missing from the lm output – Confidence Intervals. Journals often want them reported so here is a quick way of getting them (if not using a table package that gets them for you automatically)

confint(mod1)

##                     2.5 %   97.5 %
## (Intercept)    -195.02364 919.6371
## bill_length_mm   74.82279 100.0078

When plotting using ggplot2 you’ll also get a confidence band

ggplot(penguins, aes(y = body_mass_g, x = bill_length_mm)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Confidence Band",
       x = "Bill Length (mm)",
       y = "Body Mass (g)")

What do you notice in this plot?

Lastly, confidence bands are a little different from prediction bands. A confidence band is basically your confidence interval around the regression line. A prediction band:

We are predicting and individual score, not the \(\hat{Y}\) for a particular level of \(X\)
Because there is greater variation in predicting an individual value rather than a collection of individual values (i.e., the mean) the prediction band is greater

penguinsSmall = penguins %>% 
  select(body_mass_g, bill_length_mm) %>% 
  drop_na()

temp_var <- predict(mod1, interval = "prediction")
new_df <- cbind(penguinsSmall, temp_var)
ggplot(new_df, aes(y = body_mass_g,
                           x = bill_length_mm)) +
  geom_point() +
  geom_smooth(method = lm, se = TRUE) +
  geom_ribbon(aes(ymin = lwr, ymax = upr),
              fill = "blue", alpha = 0.1) +
  theme_classic() +
  labs(title = "Prediction Band vs. Confidence Band",
       x = "Bill Length (mm)",
       y = "Body Mass (g)")

Massive shout out to the Sping 2022 AI Tabea Springstein for creating this excellent Practice Set!