library(ggplot2) # graphics library
library(MASS) # contains data sets
library(ISLR) # contains code and data from the textbook
library(knitr) # contains kable() function
library(boot) # contains cross-validation functions
library(gam) # needed for additive models
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.14
options(scipen = 4) # Suppresses scientific notation
You will need the
Auto
data set from theISLR
library in order to complete this exercise.
Please run all of the code indicated in §5.3.1 of ISLR, even if I don’t explicitly ask you to do so in this document.
View()
command on the Auto
data to see what the data set looks like.#View(Auto)
qplot
to construct a scatterplot of mpg
vs horsepower
. Use stat_smooth()
to overlay a linear, quadratic, and cubic polynomial fit to the data.qplot(data = Auto, x = horsepower, y = mpg) +
stat_smooth(method = "lm", aes(colour = "Linear regression")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2), aes(colour = "Quadratic fit")) +
scale_colour_discrete("Model") +
theme_bw()
set.seed(1)
to set the seed of the random number generator. This will ensure that your answers match those in the text.# Edit me
set.seed(1)
sample()
command to construct train
, a vector of observation indexes to be used for the purpose of training your model.# Edit me
train <- sample(392, 196)
sample(n, size)
syntax, the sample
function produces a random sample of size size
from 1:n
. Sampling is done without replacement.mpg
on horsepower
, specifying subset = train
. Save this in a variable called lm.fit
.# Edit me
lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)
subset = train
?train
indexes form the training set. We want to train on just the train
data, not the full data.lm.fit
on the test set (i.e., all points that are not in train
)mse1 <- with(Auto, mean((mpg - predict(lm.fit, Auto))[-train]^2))
mse1
## [1] 26.14142
poly()
command to fit a quadratic regression model of mpg
on horsepower
, specifying subset = train
. Save this in a variable called lm.fit2
.lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
# Pull the p-value
coef(summary(lm.fit2))["poly(horsepower, 2)2", "Pr(>|t|)"]
## [1] 6.449276e-09
lm.fit2
. How does it compare to the linear regression fit?mse2 <- with(Auto, mean((mpg - predict(lm.fit2, Auto))[-train]^2))
mse2
## [1] 19.82259
poly()
command to fit a cubic regression model of mpg
on horsepower
, specifying subset = train
. Save this in a variable called lm.fit3
.lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
# Pull the p-value
coef(summary(lm.fit3))["poly(horsepower, 3)3", "Pr(>|t|)"]
## [1] 0.6791491
mse3 <- with(Auto, mean((mpg - predict(lm.fit3, Auto))[-train]^2))
mse3
## [1] 19.78252
mse1 # Linear test error
## [1] 26.14142
mse2 # Quadratic test error
## [1] 19.82259
mse3 # Cubic test error
## [1] 19.78252
set.seed(5)
. You do not have to retype all of the code. You can just change the initial set.seed()
command and see what happens.set.seed(5)
, you are getting a different random split of the data between Train and Validation. The estimates of test error also change, because we now have slightly different model estimates, and a different random set to test the models on. We now get an error of 22.2
for the linear model; 15.2
for the quadratic, and 15.2
for the cubic. These numbers are quite different from our estimates based on the first split we tried. However, the conclusion is essentially the same: The cubic model doesn’t seem to add much, but the quadratic is much better than the linear. We should use the quadratic model.This exercise introduces you to the
cv.glm()
command from theboot
library, which automates K-fold cross-validation for Generalized Linear Models (GLMs). Linear regression is one example of a GLM. Logistic regression is another. GLMs are not the same as Generalized Additive Models (GAMs).
Please run all of the code indicated in §5.3.2 of ISLR, even if I don’t explicitly ask you to do so in this document.
glm
command to fit a linear regression of mpg
on horsepower
. Call the resulting model glm.fit
Confirm that this gives the same coefficient estimates as a linear model fit with the lm
command.Note: You should fit the model to the entire data, not just to the training data.
glm.fit <- glm(mpg ~ horsepower, data = Auto)
coef(glm.fit)
## (Intercept) horsepower
## 39.9358610 -0.1578447
coef(lm(mpg ~ horsepower, data = Auto))
## (Intercept) horsepower
## 39.9358610 -0.1578447
# Are they the same? Yes.
identical(coef(glm.fit), coef(lm(mpg ~ horsepower, data = Auto)))
## [1] TRUE
cv.error
, the vector of LOOCV error estimates for polynomials of degree 1-5.Note: The computations take some time to run, so I’ve set cache = TRUE
in the code chunk header to make sure that the code below doesn’t re-execute at knit time unless this particular chunk has changed.
cv.err <- cv.glm(Auto, glm.fit)
cv.err$delta
## [1] 24.23151 24.23114
cv.error <- rep(0, 5)
for (i in 1:5) {
glm.fit <- glm(mpg ~ poly(horsepower, i), data=Auto)
cv.error[i] <- cv.glm(Auto, glm.fit)$delta[1]
}
cv.error
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
# Form a little data frame for plotting
cv.df <- data.frame(degree = 1:5,
cv.error = cv.error)
qplot(data = cv.df, x = degree, y = cv.error, geom = "line",
ylab = "LOOCV error estimate") + geom_point()
which.min(cv.error)
## [1] 5
which.min(cv.error)
model has the lowest LOOCV estimate of test error.Please run all of the code indicated in §5.3.3 of ISLR
set.seed(17)
cv.error.10 <- rep(0,10)
for (i in 1:10){
glm.fit=glm(mpg ~ poly(horsepower, i), data=Auto)
cv.error.10[i] <- cv.glm(Auto, glm.fit, K=10)$delta[1]
}
cv.error.10
## [1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609
## [8] 19.71201 18.95140 19.50196
# Form a little data frame for plotting
cv.df <- data.frame(degree = 1:10,
cv.error = cv.error.10)
qplot(data = cv.df, x = degree, y = cv.error, geom = "line",
ylab = "10-fold CV error estimate") + geom_point()
which.min(cv.error.10)
## [1] 5
Please run all of the code indicated in §7.8.3 of ISLR, up until the loading of the
akima
package. We have not yet studied logistic regression, so you are not asked to do the logistic regression analysis that starts at the bottom of p. 297. The material on ANOVA testing may also be unfamiliar to you. You may skip it.
# Use the lm function to fit an additive model
# (using df = 4 and df = 5 natural cubic splines)
gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + education, data=Wage)
# Use gam function to fit addition model with smoothing splines (df = 4, 5)
gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data=Wage)
# Construct a figure with 1 row, 3 columns
par(mfrow=c(1,3))
# Use the plot.gam function to get plots of the estimated f_j
# Note: plot() on a gam object is an alias for plot.gam()
plot(gam.m3, se=TRUE, col="blue")
# See above.
plot.gam(gam1, se=TRUE, col="red")
# Consider model where year doesn't appear
gam.m1 <- gam(wage ~ s(age,5) + education, data=Wage)
# Consider model wehre year appears linearly
gam.m2 <- gam(wage ~ year + s(age, 5) + education, data=Wage)
# Run an analysis of variance (ANOVA) test to see if
# the m2 is statistically significantly better than m1,
# and if m3 is statistically significant better than m2
# Note: m3, m2 and m1 are nested models
anova(gam.m1, gam.m2, gam.m3, test="F")
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2990 3711731
## 2 2989 3693842 1 17889.2 14.4771 0.0001447 ***
## 3 2986 3689770 3 4071.1 1.0982 0.3485661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Print-out summary of additive model
summary(gam.m3)
##
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -119.43 -19.70 -3.33 14.17 213.48
##
## (Dispersion Parameter for gaussian family taken to be 1235.69)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29887.75
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(year, 4) 1 27162 27162 21.981 0.000002877 ***
## s(age, 5) 1 195338 195338 158.081 < 2.2e-16 ***
## education 4 1069726 267432 216.423 < 2.2e-16 ***
## Residuals 2986 3689770 1236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(year, 4) 3 1.086 0.3537
## s(age, 5) 4 32.380 <2e-16 ***
## education
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use gam.m2 to get fitted values
preds <- predict(gam.m2, newdata=Wage)
# Try local regression for age instead of smoothing spline
gam.lo <- gam(wage ~ s(year, df=4) + lo(age, span=0.7) + education, data=Wage)
# Show plots of estimated fits
plot.gam(gam.lo, se=TRUE, col="green")
# Interaction model, where local regression is used to fit the joint effect of
# (year, age)
gam.lo.i <- gam(wage ~ lo(year, age, span=0.5) + education, data=Wage)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame,
## bf.maxit, : liv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame,
## bf.maxit, : lv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame,
## bf.maxit, : liv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame,
## bf.maxit, : lv too small. (Discovered by lowesd)
par(mfrow=c(1,3))
# Use the plot.gam function to get plots of the estimated f_j
# Note: plot() on a gam object is an alias for plot.gam()
plot(gam.m3, se=TRUE, col="blue")
age
plot clearly indicates that the fit is not consistent with being linear. The year
plot, however, shows error bands that easily accomodate a linear fit between the response and year
. Thus we should use a non-linear model for age
, but would be well-served by a linear model for year
.anova(gam.m1, gam.m2, gam.m3, test="F")
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2990 3711731
## 2 2989 3693842 1 17889.2 14.4771 0.0001447 ***
## 3 2986 3689770 3 4071.1 1.0982 0.3485661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
~ year + ...
model to the ~ s(4, year) + ...
model is not statistically significant. The smoothing spline fit doesn’t add sufficient additional explanatory power to warrant the increase in model complexity. This agrees with our observation from part (a).The
splines
library has asmooth.spline()
command with built-in cross-validated smoothness selection. We will now give an example of using this command.
agelims <- range(Wage$age)
# Start a scatterplot, to be overlaid with smoothing spline fits
par(mfrow = c(1,1))
with(Wage, plot(age, wage, xlim = agelims, cex=0.5, col = "darkgrey"))
title("Smoothing Spline")
# Fit smoothing spline model with 16 effective degrees of freedom
fit <- with(Wage, smooth.spline(age, wage, df=16))
# Fit smoothing spline, using LOOCV to figure out the optimal choice of df
fit2 <- with(Wage, smooth.spline(age, wage, cv=TRUE))
## Warning in smooth.spline(age, wage, cv = TRUE): cross-validation with non-
## unique 'x' values seems doubtful
# CV error minimizing choice of df
fit2$df
## [1] 6.794596
# Overlay df = 16 and CV error minimizing df smoothing spline fits
lines(fit, col="red", lwd=2)
lines(fit2, col="blue", lwd=2)
smooth.spline
function, can you figure out what kind of cross-validation is done when cv = TRUE
? i.e., It’s \(K\)-fold CV for what choice of \(K\)?smooth.spline
:leave-one-out cross-validation (ordinary or ‘generalized’ as determined by cv) is used