Let’s begin by loading the packages we’ll need to get started
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(knitr)
options(scipen=4)
We’ll begin by doing all the same data processing as in previous lectures
# Load data from MASS into a tibble
# Rename variables
# Recode categorical variables
birthwt <- as_tibble(MASS::birthwt) %>%
rename(birthwt.below.2500 = low,
mother.age = age,
mother.weight = lwt,
mother.smokes = smoke,
previous.prem.labor = ptl,
hypertension = ht,
uterine.irr = ui,
physician.visits = ftv,
birthwt.grams = bwt) %>%
mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other")) %>%
mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"),
~ recode_factor(.x, `0` = "no", `1` = "yes"))
In the case of quantitative predictors, we’re more or less comfortable with the interpretation of the linear model coefficient as a “slope” or a “unit increase in outcome per unit increase in the covariate”. This isn’t the right interpretation for factor variables. In particular, the notion of a slope or unit change no longer makes sense when talking about a categorical variable. E.g., what does it even mean to say “unit increase in major” when studying the effect of college major on future earnings?
To understand what the coefficients really mean, let’s go back to the birthwt data and try regressing birthweight on mother’s race and mother’s age.
# Fit regression model
birthwt.lm <- lm(birthwt.grams ~ race + mother.age, data = birthwt)
# Regression model summary
summary(birthwt.lm)
##
## Call:
## lm(formula = birthwt.grams ~ race + mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2131.57 -488.02 -1.16 521.87 1757.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2949.979 255.352 11.553 <2e-16 ***
## raceblack -365.715 160.636 -2.277 0.0240 *
## raceother -285.466 115.531 -2.471 0.0144 *
## mother.age 6.288 10.073 0.624 0.5332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 715.7 on 185 degrees of freedom
## Multiple R-squared: 0.05217, Adjusted R-squared: 0.0368
## F-statistic: 3.394 on 3 and 185 DF, p-value: 0.01909
Note that there are two coefficients estimated for the race variable (raceother
and racewhite
). What’s happening here?
When you put a factor variable into a regression, you’re allowing a different intercept at every level of the factor. In the present example, you’re saying that you want to model birthwt.grams
as
We can rewrite this more succinctly as: \[
y = \text{Intercept}_{race} + \beta \times \text{age}
\]
Essentially you’re saying that your data is broken down into 3 racial groups, and you want to model your data as having the same slope governing how birthweight changes with mother’s age, but potentially different intercepts. Here’s a picture of what’s happening.
# Calculate race-specific intercepts
intercepts <- c(coef(birthwt.lm)["(Intercept)"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["racewhite"])
lines.df <- data.frame(intercepts = intercepts,
slopes = rep(coef(birthwt.lm)["mother.age"], 3),
race = levels(birthwt$race))
qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
## Warning: Removed 1 rows containing missing values (geom_abline).
How do we interpret the 2 race coefficients? For categorical variables, the interpretation is relative to the given baseline. The baseline is just whatever level comes first (here, “black”). E.g., the estimate of raceother
means that the estimated intercept is -285.4657796 higher among “other” race mothers compared to black mothers. Similarly, the estimated intercept is NA higher for white mothers than black mothers.
Another way of putting it: Among mothers of the same age, babies of white mothers are born on average weighing NAg more than babies of black mothers.
As you’ve already noticed, there is no coefficient called “raceblack” in the estimated model. This is because this coefficient gets absorbed into the overall (Intercept) term.
Let’s peek under the hood. Using the model.matrix()
function on our linear model object, we can get the data matrix that underlies our regression. Here are the first 20 rows.
head(model.matrix(birthwt.lm), 20)
## (Intercept) raceblack raceother mother.age
## 1 1 1 0 19
## 2 1 0 1 33
## 3 1 0 0 20
## 4 1 0 0 21
## 5 1 0 0 18
## 6 1 0 1 21
## 7 1 0 0 22
## 8 1 0 1 17
## 9 1 0 0 29
## 10 1 0 0 26
## 11 1 0 1 19
## 12 1 0 1 19
## 13 1 0 1 22
## 14 1 0 1 30
## 15 1 0 0 18
## 16 1 0 0 18
## 17 1 1 0 15
## 18 1 0 0 25
## 19 1 0 1 20
## 20 1 0 0 28
Even though we think of the regression birthwt.grams ~ race + mother.age
as being a regression on two variables (and an intercept), it’s actually a regression on 3 variables (and an intercept). This is because the race
variable gets represented as two dummy variables: one for race == other
and the other for race == white
.
Why isn’t there a column for representing the indicator of race == black
? This gets back to our colinearity issue. By definition, we have that
This is because for every observation, one and only one of the race dummy variables will equal 1. Thus the group of 4 variables {raceblack, raceother, racewhite, (Intercept)} is perfectly colinear, and we can’t include all 4 of them in the model. The default behavior in R is to remove the dummy corresponding to the first level of the factor (here, raceblack), and to keep the rest.
Let’s go back to the regression line plot we generated above.
qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
## Warning: Removed 1 rows containing missing values (geom_abline).
We have seen similar plots before by using the geom_smooth
or stat_smooth
commands in ggplot
. Compare the plot above to the following.
qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) +
stat_smooth(method = "lm", se = FALSE, fullrange = TRUE)
In this case we have not only race-specific intercepts, but also race-specific slopes. The plot above corresponds to the model:
We can rewrite this more succinctly as: \[ y = \text{Intercept}_{race} + \beta_{race}\times \text{age} \]
To specify this interaction model in R, we use the following syntax
birthwt.lm.interact <- lm(birthwt.grams ~ race * mother.age, data = birthwt)
summary(birthwt.lm.interact)
##
## Call:
## lm(formula = birthwt.grams ~ race * mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2182.35 -474.23 13.48 523.86 1496.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2583.54 321.52 8.035 1.11e-13 ***
## raceblack 1022.79 694.21 1.473 0.1424
## raceother 326.05 545.30 0.598 0.5506
## mother.age 21.37 12.89 1.658 0.0991 .
## raceblack:mother.age -62.54 30.67 -2.039 0.0429 *
## raceother:mother.age -26.03 23.20 -1.122 0.2633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 710.7 on 183 degrees of freedom
## Multiple R-squared: 0.07541, Adjusted R-squared: 0.05015
## F-statistic: 2.985 on 5 and 183 DF, p-value: 0.01291
We now have new terms appearing. Terms like racewhite:mother.age
are deviations from the baseline slope (the coefficient of mother.age
in the model) in the same way that terms like racewhite
are deviations from the baseline intercept. This models says that:
On average among black mothers, every additional year of age is associated with a 21.4g decrease in the birthweight of the baby.
To get the slope for white mothers, we need to add the interaction term to the baseline.
\[ \begin{aligned} \beta_{racewhite} &= \beta_{raceblack} + \beta_{racewhite:mother.age} \\ &= \text{mother.age} + \text{racewhite:mother.age} \\ &= 21.4 + NA \\ &= NA \end{aligned} \]
This slope estimate is positive, which agrees with the regression plot above.
Last class we considered modelling birthweight as a linear function of mother’s age, allowing for a race-specific intercept for each of the three race categories. This model is fit again below.
birthwt.lm <- lm(birthwt.grams ~ race + mother.age, data = birthwt)
Here’s a visualization of the model fit that we wound up with. Note that while there are 3 lines shown, this is a visualization of just one model: birthwt.grams ~ race + mother.age
. This model produces 3 lines because the coefficients of the race
variable result in different intercepts.
# Calculate race-specific intercepts
intercepts <- c(coef(birthwt.lm)["(Intercept)"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["racewhite"])
lines.df <- data.frame(intercepts = intercepts,
slopes = rep(coef(birthwt.lm)["mother.age"], 3),
race = levels(birthwt$race))
qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
## Warning: Removed 1 rows containing missing values (geom_abline).
At this stage we may be interested in assessing whether the race
variable is statistically significant. i.e., Does including the race
variable significantly improve the fit of our model, or is the simpler model birthwt.grams ~ mother.age
just as good?
Essentially, we want to know if the race-specific intercepts capture significantly more variation in the outcome (birthweight) than the single intercept model, or if allowing for different intercepts isn’t doing much more than capturing random fluctuations in the data.
Here’s a picture of the two models we’re comparing:
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
plot.complex <- qplot(x = mother.age, y = birthwt.grams,
color = race, data = birthwt) +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
# Single intercept model (birthwt.grams ~ mother.age)
p <- ggplot(birthwt, aes(x = mother.age, y = birthwt.grams))
plot.simple <- p + geom_point(aes(colour = race)) + stat_smooth(method = "lm")
grid.arrange(plot.complex, plot.simple, ncol = 2)
## Warning: Removed 1 rows containing missing values (geom_abline).
To test this hypothesis, we use the anova
function (not to be confused with the aov
function). This function compares two nested models, accounting for their residual sums of squares (how well they fit the data) and their complexity (how many more variables are in the larger model) to assess statistical significance.
# Fit the simpler model with mother.age as the only predictor
birthwt.lm.simple <- lm(birthwt.grams ~ mother.age, data = birthwt)
# Compare to more complex model
anova(birthwt.lm.simple, birthwt.lm)
## Analysis of Variance Table
##
## Model 1: birthwt.grams ~ mother.age
## Model 2: birthwt.grams ~ race + mother.age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 187 99154173
## 2 185 94754346 2 4399826 4.2951 0.01502 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output tells us that the race
variable is statistically significant: It is unlikely that the improvement in fit when the add the race
variable is simply due to random fluctuations in the data. Thus it is important to consider race when modeling how birthweight depends on the mother’s age.
Assessing significance of interaction terms operates on the same principle. We once again ask whether the improvement in model fit is worth the increased complexity of our model. For instance, consider the example we saw last class, where we allowed for a race-specific slope in addition to the race-specific intercept from before.
birthwt.lm.interact <- lm(birthwt.grams ~ race * mother.age, data = birthwt)
summary(birthwt.lm.interact)
##
## Call:
## lm(formula = birthwt.grams ~ race * mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2182.35 -474.23 13.48 523.86 1496.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2583.54 321.52 8.035 1.11e-13 ***
## raceblack 1022.79 694.21 1.473 0.1424
## raceother 326.05 545.30 0.598 0.5506
## mother.age 21.37 12.89 1.658 0.0991 .
## raceblack:mother.age -62.54 30.67 -2.039 0.0429 *
## raceother:mother.age -26.03 23.20 -1.122 0.2633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 710.7 on 183 degrees of freedom
## Multiple R-squared: 0.07541, Adjusted R-squared: 0.05015
## F-statistic: 2.985 on 5 and 183 DF, p-value: 0.01291
Here’s a side-by-side visual comparison of the race + mother.age
model and the race + mother.age + race*mother.age
interaction model.
plot.interact <- qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + stat_smooth(method = "lm", se = FALSE, fullrange = TRUE)
grid.arrange(plot.interact, plot.complex, ncol = 2)
## Warning: Removed 1 rows containing missing values (geom_abline).
So, do the lines with different slopes fit the data significantly better than the common slope model? Let’s compare the two with the anova()
function.
anova(birthwt.lm, birthwt.lm.interact)
## Analysis of Variance Table
##
## Model 1: birthwt.grams ~ race + mother.age
## Model 2: birthwt.grams ~ race * mother.age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 185 94754346
## 2 183 92431148 2 2323199 2.2998 0.1032
This p-value turns out to not be statistically significant. So even though the estimated slopes in the interaction model look very different, our estimates are quite variable, so we don’t have enough evidence to conclude that the interaction term (different slopes) is providing significant additional explanatory power over the simpler race + mother.age
model.
The testing strategy above applies to any two nested models. Here’s an example where we add in a few more variables and see how it compares to the race + mother.age
model from earlier.
birthwt.lm.complex <- lm(birthwt.grams ~ mother.smokes + physician.visits + race + mother.age, data = birthwt)
summary(birthwt.lm.complex)
##
## Call:
## lm(formula = birthwt.grams ~ mother.smokes + physician.visits +
## race + mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2335.06 -455.16 31.74 499.29 1623.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3282.407 261.326 12.561 < 2e-16 ***
## mother.smokesyes -424.651 110.371 -3.847 0.000165 ***
## physician.visits 14.391 48.953 0.294 0.769102
## raceblack -444.340 156.586 -2.838 0.005057 **
## raceother -445.161 119.666 -3.720 0.000265 ***
## mother.age 1.547 9.996 0.155 0.877155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 691.7 on 183 degrees of freedom
## Multiple R-squared: 0.1241, Adjusted R-squared: 0.1001
## F-statistic: 5.184 on 5 and 183 DF, p-value: 0.000179
Let’s compare to our earlier model:
anova(birthwt.lm, birthwt.lm.complex)
## Analysis of Variance Table
##
## Model 1: birthwt.grams ~ race + mother.age
## Model 2: birthwt.grams ~ mother.smokes + physician.visits + race + mother.age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 185 94754346
## 2 183 87567280 2 7187067 7.5098 0.0007336 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Highly significant! This is probably due to the fact that mother’s smoking status has a tremendously high association with birthweight.
gapminder <- read_delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/gapminder_five_year.txt",
delim = "\t") # Load data
## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_double(),
## pop = col_double(),
## continent = col_character(),
## lifeExp = col_double(),
## gdpPercap = col_double()
## )
gapminder # A look at the data
## # A tibble: 1,704 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779.
## 2 Afghanistan 1957 9240934 Asia 30.3 821.
## 3 Afghanistan 1962 10267083 Asia 32.0 853.
## 4 Afghanistan 1967 11537966 Asia 34.0 836.
## 5 Afghanistan 1972 13079460 Asia 36.1 740.
## 6 Afghanistan 1977 14880372 Asia 38.4 786.
## 7 Afghanistan 1982 12881816 Asia 39.9 978.
## 8 Afghanistan 1987 13867957 Asia 40.8 852.
## 9 Afghanistan 1992 16317921 Asia 41.7 649.
## 10 Afghanistan 1997 22227415 Asia 41.8 635.
## # … with 1,694 more rows
Before diving into a regression, let’s look at some data summaries that use some familiar functions. Here’s how we can form a table showing the maximum life expectancy on each continent each year, and the country that attained that maximum.
gapminder %>%
group_by(continent, year) %>%
summarize(max.life.exp = max(lifeExp),
country = country[which.max(lifeExp)])
## max.life.exp country
## 1 82.603 Japan
We’re now going to go through an example where we get the life expectancy in 1952 and the rate of change in life expectancy over time for each country. The rate of change will be obtained by regressing lifeExp
on year
.
Let’s start with the data for a single country.
country.name <- "Ireland" # Pick a country
gapminder.sub <- filter(gapminder, country == country.name) # Pull data for this country
gapminder.sub
## # A tibble: 12 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Ireland 1952 2952156 Europe 66.9 5210.
## 2 Ireland 1957 2878220 Europe 68.9 5599.
## 3 Ireland 1962 2830000 Europe 70.3 6632.
## 4 Ireland 1967 2900100 Europe 71.1 7656.
## 5 Ireland 1972 3024400 Europe 71.3 9531.
## 6 Ireland 1977 3271900 Europe 72.0 11151.
## 7 Ireland 1982 3480000 Europe 73.1 12618.
## 8 Ireland 1987 3539900 Europe 74.4 13873.
## 9 Ireland 1992 3557761 Europe 75.5 17559.
## 10 Ireland 1997 3667233 Europe 76.1 24522.
## 11 Ireland 2002 3879155 Europe 77.8 34077.
## 12 Ireland 2007 4109086 Europe 78.9 40676.
# Scatterplot of life exp vs year
# with a regression line overlaid
qplot(year, lifeExp, data = gapminder.sub, main = paste("Life expectancy in", country.name)) +
geom_smooth(method = "lm")
We can confirm that it’s a pretty good model for other countries as well, though not for all of them
ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
facet_wrap( ~ country) +
geom_point() +
geom_smooth(method = "lm")
Now let’s fit a regression and extract the slope.
life.exp.lm <- lm(lifeExp ~ year, data = gapminder.sub) # Fit model
coef(life.exp.lm) # Get coefficients
## (Intercept) year
## -321.1399594 0.1991196
coef(life.exp.lm)["year"] # The slope that we wanted
## year
## 0.1991196
Here’s how we can do everything in one line:
coef(lm(lifeExp ~ year, data = gapminder.sub))["year"]
## year
## 0.1991196
Here we’ll extract the slope in the way we practiced above, and we’ll also extract the “origin” life expectancy: the given country’s life expectancy in 1952, the first year of our data. For the purpose of plotting we’ll also want the continent information, so we’ll capture that in this call too.
progress.df <- gapminder %>%
group_by(country) %>%
summarize(continent = continent[1],
origin = lifeExp[year == 1952],
slope = lm(lifeExp ~ year)$coef["year"])
progress.df
## # A tibble: 142 x 4
## country continent origin slope
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan Asia 28.8 0.275
## 2 Albania Europe 55.2 0.335
## 3 Algeria Africa 43.1 0.569
## 4 Angola Africa 30.0 0.209
## 5 Argentina Americas 62.5 0.232
## 6 Australia Oceania 69.1 0.228
## 7 Austria Europe 66.8 0.242
## 8 Bahrain Asia 50.9 0.468
## 9 Bangladesh Asia 37.5 0.498
## 10 Belgium Europe 68 0.209
## # … with 132 more rows
Let’s summarize our findings by creating a bar chart for the origin and slope, with the bars colored by continent. We’ll do this in ggplot.
This code is analogous to that presented at the end of Lecture 6
# Reorder country factor by origin
# Construct bar chart
progress.df %>%
mutate(country = reorder(country, origin)) %>%
ggplot(aes(x = country, y = origin, fill = continent)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
# Reorder country factor by slope
# Construct bar chart
progress.df %>%
mutate(country = reorder(country, slope)) %>%
ggplot(aes(x = country, y = slope, fill = continent)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
These are very interesting plots. What can you tell from looking at them?
Let’s start by looking at some plots of how GDP per capita varied by year
# Use qplot from ggplot2 to generate plots
qplot(year, gdpPercap, facets = ~ country, data = gapminder, colour = continent) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
What if we want to rearrange the plots by continent? This can be done by changing the order of the country
level.
# First step: reorder the countries by continent
# Produce a data frame that has the countries ordered alphabetically within continent
# Arrange sorts the data according to the variable(s) provided
country.df <- gapminder %>%
group_by(country) %>%
summarize(continent = continent[1]) %>%
arrange(continent)
gapminder.ordered <- gapminder %>%
mutate(country = factor(country, levels = country.df$country))
# Let's make sure that things are now ordered correctly...
levels(gapminder.ordered$country)
## character(0)
# Use qplot from ggplot2 to generate plots
qplot(year, gdpPercap, facets = ~ country, data = gapminder.ordered, colour = continent) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_smooth(method = "lm")