library(ggplot2)
library(knitr)
# Generate random 5 x 2 matrix
set.seed(12345)
matrix(rpois(10, 100), nrow = 5)
## [,1] [,2]
## [1,] 105 105
## [2,] 107 97
## [3,] 98 116
## [4,] 95 95
## [5,] 123 103
Many of the problems you’ll come across in the future are classification problems (and if they’re not, there’s you can typically turn them into classification problems). There are also notable cases where a good prediction of a quantitative outcome is desired. E.g., stock prices, home values, crop yields.
Stock prices are really hard to predict (if I could do this with any accuracy, I’d be wealthy and retired by now). Instead, we’ll look at a problem with lower stakes: predicting students’ final course grades.
set.seed(12345)
grades <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/student-mat.csv", sep = ";", header = TRUE)
grades <- grades[,-c(31, 32)] # Remove mid-year grades (columns 31 and 32)
# Set aside 95 of the students for a test set
test.indexes <- sample(1:nrow(grades), 95)
train.indexes <- setdiff(1:nrow(grades), test.indexes)
grades.train <- grades[train.indexes, ]
grades.test <- grades[test.indexes, ]
Our outcome of interest is the student’s final math grade (a score between 0 and 20). This is represented by column G3
in the data.
All of the other variables in the dataset correspond to information that’s available on the first day of class, and have no direct connection to math ability. We’re essentially looking to see if we can predict final math grades based on the students’ age, gender, family situation, and some reported aspects of recreational time use. Tall order, but let’s see how we do.
Once again, we’re taking the viewpoint that what goes into the regression doesn’t really matter. We’re not even going to look at the coefficients. We just want predictions.
grades.lm <- lm(G3 ~ ., data = grades.train)
OK, we fit a model. How well did we do?
First, what does it mean to do well? If we didn’t have any information on the students whatsoever, but we knew that the average grade in the class tends to be, our “best guess” of everyone’s grade is to guess the average (or median, if you prefer). The average course grade winds up being:
mean.grade <- mean(grades.train$G3)
mean.grade
## [1] 10.39333
How well does guessing the average work? We can calculate the mean squared error, root mean squared error, and mean absolute error of this prediction.
\[ MSE = \frac{1}{n}\sum_{i = 1}^n (y_i - \hat y_i)^2 \]
\[ RMSE = \sqrt{MSE} \]
\[ MAE = \frac{1}{n} \sum_{i = 1}^n |y_i - \hat y_i| \]
We’ll use vectorized operations to compute all these quantities (instead of loops).
# MSE (mean squared error)
mean((grades.train$G3 - mean.grade)^2)
## [1] 20.74529
# RMSE (root mean squared error)
sqrt(mean((grades.train$G3 - mean.grade)^2))
## [1] 4.5547
# MAE (mean absolute error)
mean(abs(grades.train$G3 - mean.grade))
## [1] 3.384489
We’ll be calculating these quantities a lot, so let’s wrap up the entire calculation into a function.
# Function that calculates MSE, RMSE and MAE of a predictor
# Inputs:
# fitted - fitted or predicted values
# true - vector of observed outcomes
calcErrorMetrics <- function(fitted, true) {
mse <- mean((fitted - true)^2)
rmse <- sqrt(mse)
mae <- mean(abs(fitted - true))
return(c(mse = mse, rmse = rmse, mae = mae))
}
metrics.ave <- calcErrorMetrics(mean.grade, grades.train$G3)
metrics.ave
## mse rmse mae
## 20.745289 4.554700 3.384489
Now that we have a benchmark to compare to, let’s see how well our linear regression does.
metrics.lm <- calcErrorMetrics(grades.lm$fitted, grades.train$G3)
metrics.lm
## mse rmse mae
## 14.843056 3.852669 2.959920
How did we do?
metrics.lm / metrics.ave
## mse rmse mae
## 0.7154904 0.8458667 0.8745545
We improved some, but not much. In particular, these values are based on the training data. Let’s see how we do on the test data.
Let’s repeat all of the above calculations, this time on the test data
# Note: we're using the average grade calculated on the *training data*
metrics.ave.test <- calcErrorMetrics(mean.grade, grades.test$G3)
metrics.ave.test
## mse rmse mae
## 21.542220 4.641360 3.574807
And now with our regression model
grades.predicted <- predict(grades.lm, grades.test)
metrics.lm.test <- calcErrorMetrics(grades.predicted, grades.test$G3)
metrics.lm.test
## mse rmse mae
## 18.956138 4.353865 3.434196
How did we do?
metrics.lm.test / metrics.ave.test
## mse rmse mae
## 0.8799528 0.9380580 0.9606661
Not well at all. If we look at MAE, we barely made a dent!
Indeed, let’s take a look at a scatterplot of the students’ actual score vs their predicted score.
qplot(x = grades.test$G3, y = grades.predicted)
Our predictions don’t seem very close to the real outcomes. This happens. A lot. Not everything is easily predictable.
Our performance on the grades data isn’t great. Evaluating our model on the test set allowed us to see this. If we don’t validate our model in this way, we can easily run into trouble. Here’s an example.
We’ll generate 200 variables that have nothing to do with the students’ math scores. These variables will just be random numbers. We’ll then regress math scores on these variables and look at our performance on the training data.
random.data <- data.frame(G3 = grades.train$G3, matrix(rnorm(200 * nrow(grades.train)), nrow = nrow(grades.train)))
random.lm <- lm(G3 ~ ., data = random.data)
metrics.random <- calcErrorMetrics(random.lm$fitted, grades.train$G3)
round(rbind(metrics.ave, metrics.lm, metrics.random), 2)
## mse rmse mae
## metrics.ave 20.75 4.55 3.38
## metrics.lm 14.84 3.85 2.96
## metrics.random 6.17 2.48 1.91
Our (training set) prediction error is much lower using a bunch of completely random predictors that have nothing to do with end of term grades than when we regress on data that’s actually collected from the students. Let’s look at how we do on the test data.
# Generate another batch of random numbers for the test data
random.test.data <- data.frame(matrix(rnorm(200 * nrow(grades.test)), nrow = nrow(grades.test)))
metrics.random.test <- calcErrorMetrics(predict(random.lm, random.test.data), grades.test$G3)
# prediction error on test data
round(rbind(metrics.ave.test, metrics.lm.test, metrics.random.test), 2)
## mse rmse mae
## metrics.ave.test 21.54 4.64 3.57
## metrics.lm.test 18.96 4.35 3.43
## metrics.random.test 61.91 7.87 6.41
On the test data, we see that our random variable model does horribly. Its average prediction error is nearly twice as high as just predicting with the mean.
Takeway: When you have a large number of covariates, it’s easy to overfit the data. When you overfit the training data, you get a model that describes the training data really well, but which doesn’t give good predictions on unseen data.
source: http://pingax.com/regularization-implementation-r/
In this class we’ve made considerable use of the ggplot2 and plyr packages. Here’s a list of packages that we haven’t seen in this class, but which you will likely find useful in your analytics career.
foreign - Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, …
xlsx - Read/write Excel data
RSQLite - SQLite Interface for R
RMySQL - MySQL Inferface for R
dplyr - Tools for efficiently subsetting, summarizing, rearranging, and joining together data sets
tidyr - Tools for reshaping your data into “tidy” formatting
Rcpp - Call C++ functions from R.
RPython - Call Python functions from R.
shiny - A web application framework for R
ggvis - Interactive web-based graphics
htmlwidgets - “Bring the best of JavaScript data visualization to R”