Just like every other programming language you may be familiar with, R’s capabilities can be greatly extended by installing additional “packages” and “libraries”.
To install a package, use the install.packages()
command. You’ll want to run the following commands to get the necessary packages for today’s lab:
install.packages("ggplot2")
install.packages("MASS")
install.packages("ISLR")
install.packages("knitr")
You only need to install packages once. Once they’re installed, you may use them by loading the libraries using the library()
command. For today’s lab, you’ll want to run the following code
library(ggplot2) # graphics library
library(MASS) # contains data sets, including Boston
library(ISLR) # contains code and data from the textbook
library(knitr) # contains kable() function
options(scipen = 4) # Suppresses scientific notation
This portion of the lab gets you to carry out the Lab in §3.6 of ISLR (Pages 109 - 118). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what you’re doing.
Please run all of the code indicated in §3.6 of ISLR, even if I don’t explicitly ask you to do so in this document.
Note: You may want to use the View(Boston)
command instead of fix(Boston)
.
dim()
command to figure out the number of rows and columns in the Boston housing data# View(Boston)
dim(Boston)
## [1] 506 14
nrow()
and ncol()
commands to figure out the number of rows and columns in the Boston housing data.# Edit me
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 14
names()
command to see which variables exist in the data. Which of these variables is our response variable? What does this response variable refer to? How many input variables do we have?# Edit me
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
The response variable is medv
, which represents the median house value for various neighbourhoods in Boston.
There are a total of 14 columns in the data. One of these is the response variable medv
, which leaves 13 “predictor” or “input” variables.
lm()
function to a fit linear regression of medv
on lstat
. Save the output of your linear regression in a variable called lm.fit
.# Edit me
lm.fit <- lm(medv ~ lstat, data = Boston)
summary()
command on your lm.fit
object to get a print-out of your regression results# Edit me
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
kable(coef(summary(lm.fit)), digits = c(4, 5, 2, 4))
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 34.5538 | 0.56263 | 61.42 | 0 |
lstat | -0.9500 | 0.03873 | -24.53 | 0 |
names()
on lm.fit
to explore what values this linear model object contains.names(lm.fit)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
coef()
function to get the estimated coefficients. What is the estimated Intercept? What is the coefficient of lstat
in the model? Interpret this coefficient.coef(lm.fit)
## (Intercept) lstat
## 34.5538409 -0.9500494
coef(lm.fit)["(Intercept)"]
## (Intercept)
## 34.55384
coef(lm.fit)["lstat"]
## lstat
## -0.9500494
The intercept in the model is 34.6.
The coefficient of lstat
in the model is -0.95. This means that for each 1% increase in the % of low socioeconomic status individuals residing in the neighbourhood, median home values on average decrease by $950.
mdev
vs. lstat
. Edit the xlab
and ylab
arguments to produce more meaningful axis labels. Does the linear model appear to fit the data well? Explain.qplot(data = Boston, x = lstat, y = medv,
xlab = "% of individuals of low socioeconomic status", ylab = "Median home value ($1000's)") + stat_smooth(method = "lm")
The linear model appears to be a pretty good fit to the data in the lstat
range of 10 - 25. However, the overall relationship between median home value and the % of low socioeconomic status individuals in the neighbourhood appears to be overall non-linear.
Here’s a plot showing a local regression fit to the data. The local regression model appears to do a better job of capturing the trends in the data.
qplot(data = Boston, x = lstat, y = medv,
ylab = "Median home value ($1000's)",
xlab = "% of individuals of low socioeconomic status") +
stat_smooth(method = "loess")
# Confidence intervals
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
confint
on a regression model in the above way simply produces 95% confidence intervals for the parameters. The above output gives us a 95% CI for the Intercept \(\beta_0\) and for the coefficient of lstat
.predict(lm.fit, data.frame(lstat=c(5, 10, 15)), interval ="confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
predict.lm
command:predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
lm.fit
to produce predicted values and confidence intervals for the expected value of medv
at the new data points lstat = 5, 10, 15
. These intervals match up exactly with the upper and lower endpoints of the shaded “confidence region” that you get as part of a linear model overlay. It’s a bit hard to see the values here because the confidence bands are so narrow:qplot(data = Boston, x = lstat, y = medv,
xlab = "% of individuals of low socioeconomic status",
ylab = "Median home value ($1000's)") +
stat_smooth(method = "lm") +
geom_vline(xintercept = c(5, 10, 15), lty = 2)
predict (lm.fit, data.frame(lstat=c(5, 10, 15)), interval = "prediction")
## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
Notice that the interval
type is now "prediction"
, not "confidence"
. Prediction intervals are confidence intervals for the actual value of \(Y\), not just its mean or “expected value”. They are wider because they are trying to contain not just the average value of \(Y\), but the actual value of \(Y\).
If you are not familiar with prediction intervals, there is some discussion on page 82 of ISLR. You may also find it helpful to Ctrl + F
the mentions of “predition interval” in ISLR, which will pop up a few other helpful hits.
?Boston
to figure out what the age
variable means. What does age
mean in the Boston Housing data?age
variable gives the proportion of homes in the neighbourhood built prior to 1940.qplot()
command to construct a scatterplot of medv
veruses age
. Make sure to specify meaningful x and y axis names. Overlay a linear regression line. Does a linear relationship appear to hold between the two variables?qplot(data = Boston, x = age, y = medv,
xlab = "Proportion of owner-occupied units built prior to 1940",
ylab = "Median home value ($1000's)") +
stat_smooth(method = "lm")
age
variable.lm()
command to a fit a linear regression of medv
on lstat
and age
. Save your regression model in a variable called lm.fit
.lm.fit <- lm(medv ~ lstat + age, data = Boston)
age
in your model? Interpret this coefficient.coef(lm.fit)["age"]
## age
## 0.03454434
lstat
constant, every additional % increase in pre-1940 homes in a neighbourhood is associated with an average increase of $35 in median home value.medv ~ .
syntax to fit a model regressing medv
on all the other variables. Use the summary()
and kable()
functions to produce a coefficients table in nice formatting.lm.fit <- lm(medv ~ ., data = Boston)
kable(coef(summary(lm.fit)), digits = c(3, 3, 1, 4))
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 36.459 | 5.103 | 7.1 | 0.0000 |
crim | -0.108 | 0.033 | -3.3 | 0.0011 |
zn | 0.046 | 0.014 | 3.4 | 0.0008 |
indus | 0.021 | 0.061 | 0.3 | 0.7383 |
chas | 2.687 | 0.862 | 3.1 | 0.0019 |
nox | -17.767 | 3.820 | -4.7 | 0.0000 |
rm | 3.810 | 0.418 | 9.1 | 0.0000 |
age | 0.001 | 0.013 | 0.1 | 0.9582 |
dis | -1.476 | 0.199 | -7.4 | 0.0000 |
rad | 0.306 | 0.066 | 4.6 | 0.0000 |
tax | -0.012 | 0.004 | -3.3 | 0.0011 |
ptratio | -0.953 | 0.131 | -7.3 | 0.0000 |
black | 0.009 | 0.003 | 3.5 | 0.0006 |
lstat | -0.525 | 0.051 | -10.3 | 0.0000 |
# Variables with positive signs
names(which(coef(lm.fit) > 0))
## [1] "(Intercept)" "zn" "indus" "chas" "rm"
## [6] "age" "rad" "black"
# Variables with negative signs
names(which(coef(lm.fit) < 0))
## [1] "crim" "nox" "dis" "tax" "ptratio" "lstat"
Variables with positive coefficients:
zn : proportion of residential land zoned for lots over 25,000 sq.ft.
It makes sense that neighbourhoods with larger lots have higher home valuesindus : proportion of non-retail business acres per town.
This term isn’t statistically significant, so we shouldn’t worry of this sign makes sense.chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
It’s possible that the neighbourhoods bordering the Charles river are more desirable.rm : average number of rooms per dwelling.
More bedrooms, higher price. Makes total sense.age : proportion of owner-occupied units built prior to 1940.
So-called “pre-war” buildings on the East coast tend to fetch higher prices. The coefficient is not statistically significant (perhaps due to collinearity with other variables).rad : index of accessibility to radial highways.
Highway access = easier commutes, which can make neighbourhoods more desirable/expensive.black : where Bk is the proportion of blacks by town.
This variable is hard to interpret. Why are we looking at (Bk - 0.63)^2
?Variables with negative coefficients:
crim : per capita crime rate by town.
Neighbourhoods with high crime tend to be less desirablenox : nitrogen oxides concentration (parts per 10 million).
Neighbourhoods with less air pollution tend to be more expensivedis : weighted mean of distances to five Boston employment centres.
People will pay more to live close to the main employment centres.tax : full-value property-tax rate per $10,000.
Neighbourhoods with higher tax rates tend to have lower housing prices, all else being equal. This also makes sense.ptratio : pupil-teacher ratio by town
People will pay more to send their kids to schools where there are fewer students for every teacher. Makes sense.lstat : % of pop with low socioeconomic status
: Makes sense. Less affluent neighbourhoods have lower home values.medv
onto a quadratic polynomial of lstat
by using the formula medv ~ lstat + I(lstat^2)
. Use the summary()
function to display the estimated coefficients. Is the coefficient of the squared term statistically significant?summary(lm(medv ~ lstat + I(lstat^2), data = Boston))
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
I(lstat^2)
is highly statistically significant.medv ~ lstat + lstat^2
instead. What happens?summary(lm(medv ~ lstat + lstat^2, data = Boston))
##
## Call:
## lm(formula = medv ~ lstat + lstat^2, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
medv ~ lstat
. The I()
function is importantmedv ~ poly(lstat, 2)
. Compare your results to part (a).summary(lm(medv ~ poly(lstat, 2), data = Boston))
##
## Call:
## lm(formula = medv ~ poly(lstat, 2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2456 91.76 <2e-16 ***
## poly(lstat, 2)1 -152.4595 5.5237 -27.60 <2e-16 ***
## poly(lstat, 2)2 64.2272 5.5237 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
poly(lstat, 2)
function uses an orthonormalized representation of the data. To get exactly the model from part (a), we can specify raw = TRUE
summary(lm(medv ~ poly(lstat, 2, raw = TRUE), data = Boston))
##
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## poly(lstat, 2, raw = TRUE)1 -2.332821 0.123803 -18.84 <2e-16 ***
## poly(lstat, 2, raw = TRUE)2 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
ggplot’s
stat_smooth
command allows us to visualize simple regression models in a really easy way. This set of problems helps you get accustomed to specifying polynomial and step function formulas for the purpose of visualization.
For this problem, please refer to the code posted here: Week 1 R code
ggplot
graphics to construct a scatterplot of medv
vs lstat
, overlaying a 2nd degree polynomial. Does this appear to be a good model of the data? Construct plots with higher degree polynomial fits. Do any of them appear to describe the data particularly well?qplot(data = Boston, x = lstat, y = medv,
xlab = "% of individuals of low socioeconomic status",
ylab = "Median home value ($1000's)") +
stat_smooth(method = "lm", formula = y ~ poly(x, 2)) +
ggtitle("medv ~ poly(lstat, 2)")
The quadratic model fits the data OK. It has poor behavior for lstat
above 25%: The data does not indicate that home values increase as poverty increases.
Here’s a 4th degree polynomial, which seems to do a better job of fitting hte data in the <30% lstat
range. The behavior out at lstat
> 30% is still questionable.
qplot(data = Boston, x = lstat, y = medv,
xlab = "% of individuals of low socioeconomic status",
ylab = "Median home value ($1000's)") +
stat_smooth(method = "lm", formula = y ~ poly(x, 4)) +
ggtitle("medv ~ poly(lstat, 4)")
qplot(data = Boston, x = lstat, y = medv,
xlab = "% of individuals of low socioeconomic status",
ylab = "Median home value ($1000's)") +
stat_smooth(method = "lm",
formula = y ~ cut(x, breaks = c(-Inf, 5, 10, 15, 20, Inf))) +
ggtitle("Step functions")
ptratio
as the x-axis variable, and medv
still as the y-axis variable.qplot(data = Boston, x = ptratio, y = medv,
xlab = "pupil : teacher ratio",
ylab = "Median home value ($1000's)") +
stat_smooth(method = "lm", formula = y ~ poly(x, 2)) +
ggtitle("medv ~ poly(ptratio, 2)")
qplot(data = Boston, x = ptratio, y = medv,
xlab = "pupil : teacher ratio",
ylab = "Median home value ($1000's)") +
stat_smooth(method = "lm", formula = y ~ poly(x, 1)) +
ggtitle("Linear model: medv ~ ptratio")
ptratio
instead of lstat
.qplot(data = Boston, x = ptratio, y = medv,
xlab = "pupil : teacher ratio",
ylab = "Median home value ($1000's)") +
stat_smooth(method = "lm",
formula = y ~ cut(x, breaks = c(-Inf, 13.5, 15.5, 18.5, 20, Inf))) +
ggtitle("Step functions")