In today’s Lab you will gain practice with the following concepts from today’s class:
- Interpreting linear regression coefficients of numeric covariates
- Interpreting linear regression coefficients of categorical variables
- Applying the “2 standard error rule” to construct approximate 95% confidence intervals for regression coefficients
- Using the
confint
command to construct confidence intervals for regression coefficients- Using
pairs
plots to diagnose collinearity- Using the
update
command to update a linear regression model object- Diagnosing violations of linear model assumptions using
plot
We’ll begin by loading some packages.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.3.2 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(knitr)
Cars93 <- as_tibble(MASS::Cars93)
# If you want to experiment with the ggpairs command,
# you'll want to run the following code:
# install.packages("GGally")
# library(GGally)
(a) Use the lm()
function to regress Price on: EngineSize, Origin, MPG.highway, MPG.city and Horsepower.
cars.lm <- lm(Price ~ EngineSize + Origin + MPG.highway + MPG.city + Horsepower,
data = Cars93)
(b) Use the kable()
command to produce a nicely formatted coefficients table. Ensure that values are rounded to an appropriate number of decimal places.
kable(summary(cars.lm)$coef, digits = c(3, 3, 3, 4), format = "markdown")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 3.470 | 6.224 | 0.558 | 0.5786 |
EngineSize | 1.458 | 1.043 | 1.398 | 0.1656 |
Originnon-USA | 4.714 | 1.393 | 3.385 | 0.0011 |
MPG.highway | -0.006 | 0.346 | -0.017 | 0.9868 |
MPG.city | -0.252 | 0.367 | -0.687 | 0.4939 |
Horsepower | 0.109 | 0.019 | 5.808 | 0.0000 |
(c) Interpret the coefficient of Originnon-USA
. Is it statistically significant?
The coefficient of
Originnon-USA
is 4.71. This indicates that, all else in the model held constant, vehicles manufactured outside of the USA carry a price tag that is on average $4.71 thousand dollars higher than vehicles manufactered in the US. The coefficient is statistically significant at the 0.05 level.
(d) Interpret the coefficient of MPG.highway
. Is it statistically significant?
The coefficient of
MPG.highway
is -0.006, which is close to 0 numerically and is not statistically significant. Holding all else in the model constant, MPG.highway does not appear to have much association with Price.
(d) Use the “2 standard error rule” to construct an approximate 95% confidence interval for the coefficient of MPG.highway
. Compare this to the 95% CI obtained by using the confint
command.
est <- coef(cars.lm)["MPG.highway"]
se <- summary(cars.lm)$coef["MPG.highway", "Std. Error"]
# 2se rule confidence interval:
c(est - 2 *se, est + 2 * se)
## MPG.highway MPG.highway
## -0.6983730 0.6868413
# confint approach
confint(cars.lm, parm = "MPG.highway")
## 2.5 % 97.5 %
## MPG.highway -0.6940817 0.68255
We see that the confidence intervals obtained via the two different approaches are essentially identical.
(e) Run the pairs
command on the following set of variables: EngineSize, MPG.highway, MPG.city and Horsepower. Display correlations in the Do you observe any collinearities?
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.4/strwidth(txt)
text(0.5, 0.5, txt, cex = pmax(1, cex.cor * r))
}
pairs(Cars93[,c("EngineSize", "MPG.highway", "MPG.city", "Horsepower")],
lower.panel = panel.cor)
The MPG.highway and MPG.city variables are very highly correlated.
(f) Use the update
command to update your regression model to exclude EngineSize
and MPG.city
. Display the resulting coefficients table nicely using the kable()
command.
cars.lm2 <- update(cars.lm, . ~ . - EngineSize - MPG.city)
kable(summary(cars.lm2)$coef, digits = c(3, 3, 3, 4), format = "markdown")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 7.513 | 5.705 | 1.317 | 0.1913 |
Originnon-USA | 3.484 | 1.208 | 2.885 | 0.0049 |
MPG.highway | -0.287 | 0.145 | -1.986 | 0.0501 |
Horsepower | 0.130 | 0.015 | 8.946 | 0.0000 |
(g) Does the coefficient of MPG.highway
change much from the original model? Calculate a 95% confidence interval and compare your answer to part (d). Does the CI change much from before? Explain.
# old
confint(cars.lm, parm = "MPG.highway")
## 2.5 % 97.5 %
## MPG.highway -0.6940817 0.68255
# new
confint(cars.lm2, parm = "MPG.highway")
## 2.5 % 97.5 %
## MPG.highway -0.5741476 0.0001540153
Both the estimate and the confidence interval change greatly. When we remove the highly collinear MPG.city variable from the model, the coefficient of MPG.highway increases (in magnitude). We also get a much narrower confidence interval, indicating that we are able to more precisely estimate the coefficient of MPG.highway.
(h) Run the plot
command on the linear model you constructed in part (f). Do you notice any issues?
par(mfrow = c(2, 2))
plot(cars.lm2)