library(ggplot2) # graphics library
library(ISLR) # contains code and data from the textbook
library(knitr) # contains knitting control
library(tree) # For the tree-fitting 'tree' function
library(MASS) # For Boston data
library(randomForest) # For random forests and bagging
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(gbm) # For boosting
## Loading required package: survival
## Loading required package: lattice
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
options(scipen = 4) # Suppresses scientific notation
You will need the
Carseats
data set from theISLR
library in order to complete this exercise.
Please run all of the code indicated in §8.3.3-8.3.4 of ISLR, even if I don’t explicitly ask you to do so in this document.
View()
command on the Carseats
data to see what the data set looks like.#View(Carseats)
High
variable for the purpose of classification. Our goal will be to classify whether Carseat sales in a store are high or not.High <- with(Carseats, ifelse(Sales <= 8, "No", "Yes"))
Carseats <- data.frame(Carseats, High)
# Split into training and testing
set.seed(2)
train <- sample(1:nrow(Carseats), 200)
Carseats.test <- Carseats[-train,]
High.test <- High[-train]
mtry = p
in the randomForest command, we can use the randomForest function to fit a bagged tree model. Here’s what we get.# Define training set
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
boston.test <- Boston$medv[-train]
set.seed(1)
bag.boston <- randomForest(medv ~ .,data=Boston, subset=train, mtry=13, importance=TRUE)
bag.boston
##
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 11.02509
## % Var explained: 86.65
plot
command on a fitted model object to get some kind of plot of the model. This plot shows the Out-of-bag MSE on the y-axis. What is the plot showing on the x-axis?plot(bag.boston)
mtry = p/3
for regression and mtry = sqrt(p)
for classification. Try fitting a random forest to the training data using mtry = 6
. Call your fit rf.boston
. Calculate its test set MSE.rf.boston <- randomForest(medv~., data=Boston, subset=train, mtry=6, importance=TRUE)
# MSE
yhat.rf = predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2)
## [1] 11.22316
importance
and varImpPlot
on your random forest fit. How many plots are produced by the varImpPlot
command? What are these plots showing?importance(rf.boston)
## %IncMSE IncNodePurity
## crim 12.988089 1078.43531
## zn 2.714889 76.50506
## indus 10.545100 1012.00217
## chas 3.263686 52.61111
## nox 12.906528 1156.33584
## rm 29.407391 5989.54048
## age 9.113592 521.17351
## dis 12.933480 1293.35669
## rad 3.594655 100.35282
## tax 8.819588 414.65202
## ptratio 12.224736 888.90254
## black 6.358499 336.69694
## lstat 31.387814 7645.22905
varImpPlot(rf.boston)
varImpPlot
command. Here are the details for what these plots are showing:The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
partialPlot
command to get partial dependence plots for the 2 most important variables according to the %IncMSE importance. Describe what you see.partialPlot(rf.boston, Boston, x.var = "lstat")
partialPlot(rf.boston, Boston, x.var = "rm")
gbm
function, apply boosting with 5000
trees, allowing for trees up to depth 4
. Fit to just the training data. You’ll need to specify distribution = "gaussian"
. Call your object boost.boston
, and run the summary()
command on it.set.seed(1)
boost.boston=gbm(medv~., data=Boston[train,], distribution="gaussian",
n.trees=5000, interaction.depth=4)
summary(boost.boston)
## var rel.inf
## lstat lstat 45.9627334
## rm rm 31.2238187
## dis dis 6.8087398
## crim crim 4.0743784
## nox nox 2.5605001
## ptratio ptratio 2.2748652
## black black 1.7971159
## age age 1.6488532
## tax tax 1.3595005
## indus indus 1.2705924
## chas chas 0.8014323
## rad rad 0.2026619
## zn zn 0.0148083
Note: The summary()
command will produce a plot of each variable’s relative influence. This should be interpreted just like the 2nd importance measure for random forests: It’s the decrease in error from splitting on the variable, averaged over all the trees.
plot
command to get partial dependence plots for rm
and lstat
. How do these compare to the partial dependence plots from the random forest fit?par(mfrow=c(1,2))
plot(boost.boston,i="rm")
plot(boost.boston,i="lstat")
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
## [1] 11.84434
shrinkage = 0.001
. What happens if we take shrinkage = 0.2
? Fit this model, keeping all other settings the same. Calculate the test MSE. Compare to the test MSE from the earlier model.boost.boston <- gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost <- predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
## [1] 11.51109
shrinkage = 0.7
. Calculate test MSE, and compare to the previous two versions.boost.boston <- gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.7,verbose=F)
yhat.boost <- predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
## [1] 22.0092
gbm
has built-in Cross-validation functionality. Set the argument cv.folds = 10
to perform cross-validation. Call your object boost.boston.cv
. Use the default shrinkage settings.boost.boston.cv <- gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,verbose=F, cv.folds = 10)
boost.boston.cv
now has an attribute called cv.error
. This attribute gives the 10-fold CV error for each tree size. Plot the CV error against tree size. What number of trees gives the lowest CV error? What is the minimum CV error achieved by boosting?qplot(1:5000, boost.boston.cv$cv.error, xlab = "Number of trees")
# minimum error
min(boost.boston.cv$cv.error)
## [1] 15.29775
shrinkage = 0.2
. Comment on your findings.boost.boston.cv <- gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,verbose=F, shrinkage = 0.2, cv.folds = 10)
qplot(1:5000, boost.boston.cv$cv.error, xlab = "Number of trees")
# Minimum CV error attained
min(boost.boston.cv$cv.error)
## [1] 14.21935
# Number of trees used in min CV model
which.min(boost.boston.cv$cv.error)
## [1] 110
Here’s the small data set that’s generated in the book. This is a very stylized example, but it’s good enough for the purpose of introducing you to the basic clustering functions in R.
set.seed(2)
# generate data
x <- matrix(rnorm(50*2), ncol=2)
# shift the first 25 points to have mean (3, -4) instead of (0, 0)
x[1:25, 1] <- x[1:25, 1] + 3
x[1:25, 2] <- x[1:25, 2] - 4
class.lbl <- as.factor(c(rep(1, 25), rep(2, 25)))
qplot(x = x[,1], y = x[,2], colour = class.lbl, size = I(3)) + theme_bw()
kmeans
command to cluster the data into 2 classes. Specify nstart = 10
. What does the nstart
parameter mean?km.out <- kmeans(x, 2, nstart=20)
table(km.out$cluster, class.lbl)
## class.lbl
## 1 2
## 1 0 25
## 2 25 0
set.seed(4)
km.out <- kmeans(x, 3, nstart=20)
km.out
## K-means clustering with 3 clusters of sizes 10, 23, 17
##
## Cluster means:
## [,1] [,2]
## 1 2.3001545 -2.69622023
## 2 -0.3820397 -0.08740753
## 3 3.7789567 -4.56200798
##
## Clustering vector:
## [1] 3 1 3 1 3 3 3 1 3 1 3 1 3 1 3 1 3 3 3 3 3 1 3 3 3 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 19.56137 52.67700 25.74089
## (between_SS / total_SS = 79.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
plot(x, col=(km.out$cluster+1), main="K-Means Clustering Results with K=3", xlab="", ylab="", pch=20, cex=2)
nstart = 1
. Try it a bunch of times. Do you always get the same answer?set.seed(12)
km.out <- kmeans(x, 3, nstart = 1)
km.out
## K-means clustering with 3 clusters of sizes 22, 17, 11
##
## Cluster means:
## [,1] [,2]
## 1 3.4288025 -4.3144733
## 2 -0.5901963 0.3953919
## 3 1.1869615 -1.6663602
##
## Clustering vector:
## [1] 1 3 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 2 3 3 3 3
## [36] 2 2 2 2 2 2 2 2 3 3 3 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 49.89617 31.02061 19.97828
## (between_SS / total_SS = 78.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
plot(x, col=(km.out$cluster+1), main="K-Means Clustering Results with K=3", xlab="", ylab="", pch=20, cex=2)
We’re going to continue using the same simple dataset as in Problem 3. But this time we’re going to learn about the
hclust
command for hierarchical clustering.
hclust
command with different choices of method
to perform complete, average, and single linkage clustering of the data. Note that the hclust
function requires a distance matrix as the first argument. You will want to pass in dist(x)
, not x
, to the hclust
command.hc.complete <- hclust(dist(x), method="complete")
hc.average <- hclust(dist(x), method="average")
hc.single <- hclust(dist(x), method="single")
par(mfrow=c(1,3))
plot(hc.complete,main="Complete Linkage", xlab="", sub="", cex=.9)
plot(hc.average, main="Average Linkage", xlab="", sub="", cex=.9)
plot(hc.single, main="Single Linkage", xlab="", sub="", cex=.9)
k = 2
. You will notice from the documentation ?cutree
that you can cut a dendrogram either by specifying the number of clusters you want, or the height you wish to cut at. Comment on the quality of the clusters.cutree(hc.complete, 2)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
cutree(hc.average, 2)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 2
## [36] 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
cutree(hc.single, 2)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
par(mfrow = c(1,3))
plot(x, col=(cutree(hc.complete, 2)+1), main="Complete linkage clustering with K=2", xlab="", ylab="", pch=20, cex=2)
plot(x, col=(cutree(hc.average, 2)+1), main="Average linkage clustering with K=2", xlab="", ylab="", pch=20, cex=2)
plot(x, col=(cutree(hc.single, 2)+1), main="Single linkage clustering with K=2", xlab="", ylab="", pch=20, cex=2)
k = 4
with single linkage do any better? Plot your clustering.cutree(hc.single, 4)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3
## [36] 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3
par(mfrow = c(1,2))
plot(x, col=(cutree(hc.single, 2)+1), main="Single linkage clustering with K=2", xlab="", ylab="", pch=20, cex=2)
plot(x, col=(cutree(hc.single, 4)+1), main="Single linkage clustering with K=4", xlab="", ylab="", pch=20, cex=2)
With k = 4
, Single linkage does a better job of recovering the original two classes. Two of the 4 clusters have just one point each. The other two clusters are essentially the two classes we originally started with.
Note: Single linkage clustering commonly results in a mix of very small and very large clusters. It is common to take Single Linkage clustering and follow it up with a pruning step, where all small clusters are discarded. Single linkage can successfully capture irregularly shaped clusters in ways that the other linkages cannot. So this single linkage + prune approach can sometimes be quite successful.
par(mfrow = c(1,1))
x <- matrix(rnorm(30*3), ncol=3)
dd <- as.dist(1-cor(t(x)))
plot(hclust(dd, method="complete"), main="Complete Linkage with Correlation-Based Distance", xlab="", sub="")