Lab 3

1. Changing the author field and file name.
2. Best Subset Selection
3. Forward and Backward Stepwise Selection
4. Choosing Among Models using the Validation Set Approach and Cross-Validation
5. The Lasso

library(ggplot2) # graphics library
library(ISLR)    # contains code and data from the textbook

## Warning: package 'ISLR' was built under R version 3.4.2

library(knitr)   # contains kable() function
library(leaps)   # for regsubsets() function
library(boot)    # for cv.glm
library(gam)

## Warning: package 'gam' was built under R version 3.4.4

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.16

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 2.0-10

options(scipen = 4)  # Suppresses scientific notation

1. Changing the author field and file name.

(a) Change the `author:` field on the Rmd document from Your Name Here to your own name.

(b) Rename this file to “lab03_YourHameHere.Rmd”, where YourNameHere is changed to your own name.

2. Best Subset Selection

This portion of the lab gets you to carry out the Lab in §6.5.1 of ISLR (Pages 244 - 247). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what you’re doing.

You will need the Hitters data set from the ISLR library in order to complete this exercise.

Please run all of the code indicated in §6.5.1 of ISLR, even if I don’t explicitly ask you to do so in this document.

Run the `View()` command on the `Hitters` data to see what the data set looks like.

#View(Hitters)

(a) Use `qplot` to construct a histogram of of the `Salary` variable. Does `Salary` appear to be normally distributed, or is the distribution skewed? What units is `Salary` recorded in?

qplot(data = Hitters, x = Salary) + theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 59 rows containing non-finite values (stat_bin).

Salary denotes a player’s 1987 annual salary as recorded on the opening day of the season. This variable is measured in thousands of dollars. i.e., Salary = 1500 corrsponds to a salary of $1.5million.

(b) Below is a modified `panel.cor` function that properly handles missing values. Use the `pairs` command to construct a pairs plot for the Hitters data, displaying correlations in the lower panel and plots in the upper panel. Your pairs plot should include the variables: `Salary`, `AtBat`, `Hits`, `HmRun`, `CRBI`, `RBI`, `Errors`. Read the `?Hitters` documentation to understand what these variables mean.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y, use = "complete.obs"))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(Hitters[, c("Salary", "AtBat", "Hits", "HmRun", "CRBI", "RBI", "Errors")],
      lower.panel = panel.cor)

(c) Looking at the results from part (b), do you detect any highly correlated predictors? Based on the definitions of the variables, can you come up with an explanation for why the variables you identified wind up being highly correlated? How might highly-correlated predictors make model selection difficult?

Hits and AtBat are highly correlated. RBI and HmRun are as well (same with RBI and Hits, AtBat). This is not surprising, because home runs typically result in RBIs; as do hits and at bats.

(d) Follow the ISL example of using removing `NA` values using the `na.omit()` command. Then, use the `regsubsets` command to perform best subset selection. Your should go up to models of size `15`.

Hitters <- na.omit(Hitters)
regfit.full <- regsubsets(Salary ~ ., Hitters, nvmax = 15)
summary(regfit.full)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters, nvmax = 15)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 15
## Selection Algorithm: exhaustive
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 )  " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  " "  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  " "  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "

reg.summary <- summary(regfit.full)

What is the best 4-variable model according to the best subset search?

coef(regfit.full, 4)

##  (Intercept)         Hits         CRBI    DivisionW      PutOuts 
##   13.9231044    2.6757978    0.6817790 -139.9538855    0.2735002

(e) Construct plots of $R^2$, RSS, AIC and BIC for the sequence of models you obtained in the previous problem. Use the points() approach outlines in the text to also indicate the criterion minimizing models for each of the curves.

par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
which.max(reg.summary$adjr2)

## [1] 11

points(11,reg.summary$adjr2[11], col="red",cex=2,pch=20)
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
which.min(reg.summary$cp)

## [1] 10

points(10,reg.summary$cp[10],col="red",cex=2,pch=20)
which.min(reg.summary$bic)

## [1] 6

plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
points(6,reg.summary$bic[6],col="red",cex=2,pch=20)

(f) Explain what the plots resulting from the commands below are showing.

plot(regfit.full,scale="r2")

plot(regfit.full,scale="adjr2")

plot(regfit.full,scale="Cp")

plot(regfit.full,scale="bic")

As per the documentation for plot.regsubsets: Plots a table of models showing which variables are in each model. The models are ordered by the specified model selection statistic. This plot is particularly useful when there are more than ten or so models and the simple table produced by summary.regsubsets is too big to read.

(g) Which variables are selected in the BIC-optimal model? What are the values of their coefficients?

coef(regfit.full,6)

##  (Intercept)        AtBat         Hits        Walks         CRBI 
##   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 
##    DivisionW      PutOuts 
## -122.9515338    0.2643076

3. Forward and Backward Stepwise Selection

The next portion of the lab gets you to carry out the Lab in §6.5.2 of ISLR (Page 247). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what you’re doing.

(a) Apply Forward and Backward stepwise selection to the `Hitters` data.

regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward")
summary(regfit.fwd)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"

regfit.bwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="backward")
summary(regfit.bwd)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: backward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 4  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 5  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  " "  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  " "  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 5  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 6  ( 1 )  " "  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  " "  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"

(b) Compare the best 3-variable model identified by best subset, forward, and backward selection. Are they the same or different?

coef(regfit.full,3)

## (Intercept)        Hits        CRBI     PutOuts 
## -71.4592204   2.8038162   0.6825275   0.2735814

coef(regfit.fwd,3)

## (Intercept)        Hits        CRBI     PutOuts 
## -71.4592204   2.8038162   0.6825275   0.2735814

coef(regfit.bwd,3)

## (Intercept)        Hits       CRuns     PutOuts 
## -79.3969049   2.6431989   0.6648928   0.3100558

(c) Compare the models selected by BIC using best subset, forward, and backward selection. Are these models the same or different?

# BIC selected best subset model
coef(regfit.full, which.min(summary(regfit.full)$bic))

##  (Intercept)        AtBat         Hits        Walks         CRBI 
##   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 
##    DivisionW      PutOuts 
## -122.9515338    0.2643076

# BIC selected forward stepwise model
coef(regfit.fwd, which.min(summary(regfit.fwd)$bic))

##  (Intercept)        AtBat         Hits        Walks         CRBI 
##   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 
##    DivisionW      PutOuts 
## -122.9515338    0.2643076

# BIC selected backward model
coef(regfit.bwd, which.min(summary(regfit.bwd)$bic))

##  (Intercept)        AtBat         Hits        Walks        CRuns 
##  117.1520434   -2.0339209    6.8549136    6.4406642    0.7045391 
##         CRBI       CWalks    DivisionW      PutOuts 
##    0.5273238   -0.8066062 -123.7798366    0.2753892

The best subset and forward stepwise models wind up being the same, but the backward stepwise BIC-selected model has more variables.

4. Choosing Among Models using the Validation Set Approach and Cross-Validation

The next portion of the lab gets you to carry out the Lab in §6.5.3 of ISLR (Page 248 - 250). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what you’re doing.

All of the code needed to carry out this component of the lab is provided for you below. As you’re going through the textbook explanation and running the code, answer the following conceptual questions:

When we use the Validation Set Approach or Cross-Validation, do we first run best subset/forward/backward selection to get a sequence of models, or do we perform these steps on the training set of each validation method?

It is critical that the selection procedure be conducted inside the cross-validation routine. If you first get a sequence of models through, say, Forward Stepwise (using the full data) and then you apply Cross-Validation to estimate the test error of each model, your estimates will be overly optimistic. -As a general rule, any part of your model fitting that ever uses the outcome variable Y needs to be put inside the cross-validation loop in order to get accurate test error estimates.

True or False: (In the validation set approach…) In the end, the variables we wind up using in our final model (the one fit to the full data) will be the same as those selected on the training set.

FALSE. The validation set approach allows us to pick a model size. The final model is obtained by running, say, Forward Stepwise Selection on the full data, and choosing the model of the size selected by the Validation set approach. E.g., your Validation set approach may tell you that the 10 predictor model has the lowest estimated test error. But the best 10 variable model on your full data may differ from the best 10 variable model from your Training split.

True or False: When we use Cross-Validation with Best subset/Forward/Backward selection, we need to apply the variable selection method on each set of training data.

TRUE. This is a rephrasing of the first question.

# You'll need to set eval = TRUE in the code chunk header
# in order for this code to run
set.seed(1)
train=sample(c(TRUE,FALSE), nrow(Hitters),rep=TRUE)
test=(!train)
regfit.best=regsubsets(Salary~.,data=Hitters[train,],nvmax=19)
test.mat=model.matrix(Salary~.,data=Hitters[test,])
val.errors=rep(NA,19)
for(i in 1:19){
   coefi=coef(regfit.best,id=i)
   pred=test.mat[,names(coefi)]%*%coefi
   val.errors[i]=mean((Hitters$Salary[test]-pred)^2)
}
val.errors
which.min(val.errors)
coef(regfit.best,10)
predict.regsubsets=function(object,newdata,id,...){
  form=as.formula(object$call[[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object,id=id)
  xvars=names(coefi)
  mat[,xvars]%*%coefi
  }
regfit.best=regsubsets(Salary~.,data=Hitters,nvmax=19)
coef(regfit.best,10)
k=10
set.seed(1)
folds=sample(1:k,nrow(Hitters),replace=TRUE)
cv.errors=matrix(NA,k,19, dimnames=list(NULL, paste(1:19)))
for(j in 1:k){
  best.fit=regsubsets(Salary~.,data=Hitters[folds!=j,],nvmax=19)
  for(i in 1:19){
    pred=predict(best.fit,Hitters[folds==j,],id=i)
    cv.errors[j,i]=mean( (Hitters$Salary[folds==j]-pred)^2)
    }
  }
mean.cv.errors=apply(cv.errors,2,mean)
mean.cv.errors
par(mfrow=c(1,1))
plot(mean.cv.errors,type='b')
reg.best=regsubsets(Salary~.,data=Hitters, nvmax=19)
coef(reg.best,11)

5. The Lasso

The next portion of the lab gets you to carry out the Lab in §6.6.2 of ISLR (Page 255). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what you’re doing.

# Define x matrix and y vector for use with glmnet
x <- model.matrix(Salary~.,Hitters)[,-1]
y <- Hitters$Salary

# Split data into test and train
set.seed(1)
train <- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.test <- y[test]

# Predefined grid of lambda values:
grid=10^seq(10,-2, length =100)

(a) Use the `glmnet` command to fit a lasso model to the `train` subset of the `Hitters` data. You’ll want to specify `lambda=grid` to use the predefined sequence of $lambda$ values constructed above. Use the `plot` command to produce a regularization plot of your model fit.

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)

(b) Apply cross-validation on the training data using `cv.glmnet`. Use the `plot` command to construct a CV error curve.

set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)

(c) What value of $lambda$ minimizes the CV error? What is the test set prediction error for the model at this choice of lambda? Is this error similar to the CV error?

bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)

## [1] 100743.4

(d) What is the 1-SE rule choice of $\lambda$? What is the test set prediction error for the model at this choice of lambda? Is this error similar to the CV error?

bestlam.1se <- cv.out$lambda.1se
lasso.pred.1se <- predict(lasso.mod,s=bestlam.1se,newx=x[test,])
mean((lasso.pred.1se-y.test)^2)

## [1] 142495.4

(e) How many non-zero coefficients are there in the $\lambda$-min model? How about the 1-SE model?

out <- glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:20,]
lasso.coef

##  (Intercept)        AtBat         Hits        HmRun         Runs 
##   18.5394844    0.0000000    1.8735390    0.0000000    0.0000000 
##          RBI        Walks        Years       CAtBat        CHits 
##    0.0000000    2.2178444    0.0000000    0.0000000    0.0000000 
##       CHmRun        CRuns         CRBI       CWalks      LeagueN 
##    0.0000000    0.2071252    0.4130132    0.0000000    3.2666677 
##    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
## -103.4845458    0.2204284    0.0000000    0.0000000    0.0000000

lasso.coef[lasso.coef!=0]

##  (Intercept)         Hits        Walks        CRuns         CRBI 
##   18.5394844    1.8735390    2.2178444    0.2071252    0.4130132 
##      LeagueN    DivisionW      PutOuts 
##    3.2666677 -103.4845458    0.2204284

# Another approach:
nrow(predict(out, s = bestlam, type = "nonzero")) # nonzero in lambda min

## [1] 7

nrow(predict(out, s = bestlam.1se, type = "nonzero")) # nonzero in 1se

## [1] 5

Lab 3

Your Name Here

1. Changing the author field and file name.

(a) Change the `author:` field on the Rmd document from Your Name Here to your own name.

(b) Rename this file to “lab03_YourHameHere.Rmd”, where YourNameHere is changed to your own name.

2. Best Subset Selection

Run the `View()` command on the `Hitters` data to see what the data set looks like.

(a) Use `qplot` to construct a histogram of of the `Salary` variable. Does `Salary` appear to be normally distributed, or is the distribution skewed? What units is `Salary` recorded in?

(d) Follow the ISL example of using removing `NA` values using the `na.omit()` command. Then, use the `regsubsets` command to perform best subset selection. Your should go up to models of size `15`.

What is the best 4-variable model according to the best subset search?

(e) Construct plots of \(R^2\), RSS, AIC and BIC for the sequence of models you obtained in the previous problem. Use the points() approach outlines in the text to also indicate the criterion minimizing models for each of the curves.

(f) Explain what the plots resulting from the commands below are showing.

(g) Which variables are selected in the BIC-optimal model? What are the values of their coefficients?

3. Forward and Backward Stepwise Selection

(a) Apply Forward and Backward stepwise selection to the `Hitters` data.

(b) Compare the best 3-variable model identified by best subset, forward, and backward selection. Are they the same or different?

(c) Compare the models selected by BIC using best subset, forward, and backward selection. Are these models the same or different?

4. Choosing Among Models using the Validation Set Approach and Cross-Validation

All of the code needed to carry out this component of the lab is provided for you below. As you’re going through the textbook explanation and running the code, answer the following conceptual questions:

5. The Lasso

(a) Use the `glmnet` command to fit a lasso model to the `train` subset of the `Hitters` data. You’ll want to specify `lambda=grid` to use the predefined sequence of \(lambda\) values constructed above. Use the `plot` command to produce a regularization plot of your model fit.

(b) Apply cross-validation on the training data using `cv.glmnet`. Use the `plot` command to construct a CV error curve.

(c) What value of \(lambda\) minimizes the CV error? What is the test set prediction error for the model at this choice of lambda? Is this error similar to the CV error?

(d) What is the 1-SE rule choice of \(\lambda\)? What is the test set prediction error for the model at this choice of lambda? Is this error similar to the CV error?

(e) How many non-zero coefficients are there in the \(\lambda\)-min model? How about the 1-SE model?

Lab 3

Your Name Here

1. Changing the author field and file name.

(a) Change the author: field on the Rmd document from Your Name Here to your own name.

(b) Rename this file to “lab03_YourHameHere.Rmd”, where YourNameHere is changed to your own name.

2. Best Subset Selection

Run the View() command on the Hitters data to see what the data set looks like.

(a) Use qplot to construct a histogram of of the Salary variable. Does Salary appear to be normally distributed, or is the distribution skewed? What units is Salary recorded in?

(d) Follow the ISL example of using removing NA values using the na.omit() command. Then, use the regsubsets command to perform best subset selection. Your should go up to models of size 15.

What is the best 4-variable model according to the best subset search?

(e) Construct plots of \(R^2\), RSS, AIC and BIC for the sequence of models you obtained in the previous problem. Use the points() approach outlines in the text to also indicate the criterion minimizing models for each of the curves.

(f) Explain what the plots resulting from the commands below are showing.

(g) Which variables are selected in the BIC-optimal model? What are the values of their coefficients?

3. Forward and Backward Stepwise Selection

(a) Apply Forward and Backward stepwise selection to the Hitters data.

(b) Compare the best 3-variable model identified by best subset, forward, and backward selection. Are they the same or different?

(c) Compare the models selected by BIC using best subset, forward, and backward selection. Are these models the same or different?

4. Choosing Among Models using the Validation Set Approach and Cross-Validation

All of the code needed to carry out this component of the lab is provided for you below. As you’re going through the textbook explanation and running the code, answer the following conceptual questions:

5. The Lasso

(a) Use the glmnet command to fit a lasso model to the train subset of the Hitters data. You’ll want to specify lambda=grid to use the predefined sequence of \(lambda\) values constructed above. Use the plot command to produce a regularization plot of your model fit.

(b) Apply cross-validation on the training data using cv.glmnet. Use the plot command to construct a CV error curve.

(c) What value of \(lambda\) minimizes the CV error? What is the test set prediction error for the model at this choice of lambda? Is this error similar to the CV error?

(d) What is the 1-SE rule choice of \(\lambda\)? What is the test set prediction error for the model at this choice of lambda? Is this error similar to the CV error?

(e) How many non-zero coefficients are there in the \(\lambda\)-min model? How about the 1-SE model?

(a) Change the `author:` field on the Rmd document from Your Name Here to your own name.

Run the `View()` command on the `Hitters` data to see what the data set looks like.

(a) Use `qplot` to construct a histogram of of the `Salary` variable. Does `Salary` appear to be normally distributed, or is the distribution skewed? What units is `Salary` recorded in?

(d) Follow the ISL example of using removing `NA` values using the `na.omit()` command. Then, use the `regsubsets` command to perform best subset selection. Your should go up to models of size `15`.

(a) Apply Forward and Backward stepwise selection to the `Hitters` data.

(a) Use the `glmnet` command to fit a lasso model to the `train` subset of the `Hitters` data. You’ll want to specify `lambda=grid` to use the predefined sequence of \(lambda\) values constructed above. Use the `plot` command to produce a regularization plot of your model fit.

(b) Apply cross-validation on the training data using `cv.glmnet`. Use the `plot` command to construct a CV error curve.