Lab 1

1. Changing the author field and file name.
2. Installing and loading packages
3. Simple Linear Regression with the Boston Housing data.
4. Multiple Linear Regression with the Boston Housing data
5. Non-linear transformations of the predictors
6. ggplot visualizations

1. Changing the author field and file name.

(a) Change the `author:` field on the Rmd document from Your Name Here to your own name.

(b) Rename this file to “lab01_YourHameHere.Rmd”, where YourNameHere is changed to your own name.

2. Installing and loading packages

Just like every other programming language you may be familiar with, R’s capabilities can be greatly extended by installing additional “packages” and “libraries”.

To install a package, use the install.packages() command. You’ll want to run the following commands to get the necessary packages for today’s lab:

install.packages("ggplot2")
install.packages("MASS")
install.packages("ISLR")
install.packages("knitr")

You only need to install packages once. Once they’re installed, you may use them by loading the libraries using the library() command. For today’s lab, you’ll want to run the following code

library(ggplot2) # graphics library
library(MASS)    # contains data sets, including Boston
library(ISLR)    # contains code and data from the textbook
library(knitr)   # contains kable() function

options(scipen = 4)  # Suppresses scientific notation

3. Simple Linear Regression with the Boston Housing data.

This portion of the lab gets you to carry out the Lab in §3.6 of ISLR (Pages 109 - 118). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what you’re doing.

Please run all of the code indicated in §3.6 of ISLR, even if I don’t explicitly ask you to do so in this document.

Note: You may want to use the View(Boston) command instead of fix(Boston).

(a) Use the `dim()` command to figure out the number of rows and columns in the Boston housing data

# View(Boston)
dim(Boston)

## [1] 506  14

(b) Use the `nrow()` and `ncol()` commands to figure out the number of rows and columns in the Boston housing data.

# Edit me
nrow(Boston)

## [1] 506

ncol(Boston)

## [1] 14

(c) Use the `names()` command to see which variables exist in the data. Which of these variables is our response variable? What does this response variable refer to? How many input variables do we have?

# Edit me
names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

The response variable is medv, which represents the median house value for various neighbourhoods in Boston.
There are a total of 14 columns in the data. One of these is the response variable medv, which leaves 13 “predictor” or “input” variables.

(d) Use the `lm()` function to a fit linear regression of `medv` on `lstat`. Save the output of your linear regression in a variable called `lm.fit`.

# Edit me
lm.fit <- lm(medv ~ lstat, data = Boston)

(e) Use the `summary()` command on your `lm.fit` object to get a print-out of your regression results

# Edit me
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

(f) Uncomment the line below to get a ‘nice’ printout of the coefficients table

kable(coef(summary(lm.fit)), digits = c(4, 5, 2, 4))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	34.5538	0.56263	61.42	0
lstat	-0.9500	0.03873	-24.53	0

(g) Call `names()` on `lm.fit` to explore what values this linear model object contains.

names(lm.fit)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

(h) Use the `coef()` function to get the estimated coefficients. What is the estimated Intercept? What is the coefficient of `lstat` in the model? Interpret this coefficient.

coef(lm.fit)

## (Intercept)       lstat 
##  34.5538409  -0.9500494

coef(lm.fit)["(Intercept)"]

## (Intercept) 
##    34.55384

coef(lm.fit)["lstat"]

##      lstat 
## -0.9500494

The intercept in the model is 34.6.
The coefficient of lstat in the model is -0.95. This means that for each 1% increase in the % of low socioeconomic status individuals residing in the neighbourhood, median home values on average decrease by $950.

(i) Here’s a ggplot command that overlays a linear regression line on a scatterplot of `mdev` vs. `lstat`. Edit the `xlab` and `ylab` arguments to produce more meaningful axis labels. Does the linear model appear to fit the data well? Explain.

qplot(data = Boston, x = lstat, y = medv,
      xlab = "% of individuals of low socioeconomic status", ylab = "Median home value ($1000's)") + stat_smooth(method = "lm")

The linear model appears to be a pretty good fit to the data in the lstat range of 10 - 25. However, the overall relationship between median home value and the % of low socioeconomic status individuals in the neighbourhood appears to be overall non-linear.
Here’s a plot showing a local regression fit to the data. The local regression model appears to do a better job of capturing the trends in the data.

qplot(data = Boston, x = lstat, y = medv,
      ylab = "Median home value ($1000's)",
      xlab = "% of individuals of low socioeconomic status") +
  stat_smooth(method = "loess")

(i) Follow the ISLR examples for getting confidence intervals and prediction intervals for the regression data.

# Confidence intervals
confint(lm.fit)

##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

Running confint on a regression model in the above way simply produces 95% confidence intervals for the parameters. The above output gives us a 95% CI for the Intercept $\beta_0$ and for the coefficient of lstat.

predict(lm.fit, data.frame(lstat=c(5, 10, 15)), interval ="confidence")

##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461

Here’s a snippet from the documentation for the predict.lm command:

predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
        interval = c("none", "confidence", "prediction"),
        level = 0.95, type = c("response", "terms"),
        terms = NULL, na.action = na.pass,
        pred.var = res.var/weights, weights = 1, ...)

The command we ran thus used the model lm.fit to produce predicted values and confidence intervals for the expected value of medv at the new data points lstat = 5, 10, 15. These intervals match up exactly with the upper and lower endpoints of the shaded “confidence region” that you get as part of a linear model overlay. It’s a bit hard to see the values here because the confidence bands are so narrow:

qplot(data = Boston, x = lstat, y = medv,
      xlab = "% of individuals of low socioeconomic status", 
      ylab = "Median home value ($1000's)") + 
  stat_smooth(method = "lm") +
  geom_vline(xintercept = c(5, 10, 15), lty = 2)

predict (lm.fit, data.frame(lstat=c(5, 10, 15)), interval = "prediction")

##        fit       lwr      upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310  8.077742 32.52846

Notice that the interval type is now "prediction", not "confidence". Prediction intervals are confidence intervals for the actual value of $Y$, not just its mean or “expected value”. They are wider because they are trying to contain not just the average value of $Y$, but the actual value of $Y$.
If you are not familiar with prediction intervals, there is some discussion on page 82 of ISLR. You may also find it helpful to Ctrl + F the mentions of “predition interval” in ISLR, which will pop up a few other helpful hits.

4. Multiple Linear Regression with the Boston Housing data

(a) Use the command `?Boston` to figure out what the `age` variable means. What does `age` mean in the Boston Housing data?

The age variable gives the proportion of homes in the neighbourhood built prior to 1940.

(b) Following the example in part 3(i) of this lab, use the `qplot()` command to construct a scatterplot of `medv` veruses `age`. Make sure to specify meaningful x and y axis names. Overlay a linear regression line. Does a linear relationship appear to hold between the two variables?

qplot(data = Boston, x = age, y = medv, 
       xlab = "Proportion of owner-occupied units built prior to 1940",
       ylab = "Median home value ($1000's)") +
  stat_smooth(method = "lm")

The linear model seems OK here. There is perhaps a ‘dip’ in the 90%+ range that the model doesn’t capture, but this is difficult to see due to the skewed distribution of the age variable.

(c) Use the `lm()` command to a fit a linear regression of `medv` on `lstat` and `age`. Save your regression model in a variable called `lm.fit`.

lm.fit <- lm(medv ~ lstat + age, data = Boston)

(d) What is the coefficient of `age` in your model? Interpret this coefficient.

coef(lm.fit)["age"]

##        age 
## 0.03454434

This says that, holding lstat constant, every additional % increase in pre-1940 homes in a neighbourhood is associated with an average increase of $35 in median home value.

(e) Use `medv ~ .` syntax to fit a model regressing `medv` on all the other variables. Use the `summary()` and `kable()` functions to produce a coefficients table in nice formatting.

lm.fit <- lm(medv ~ ., data = Boston)
kable(coef(summary(lm.fit)), digits = c(3, 3, 1, 4))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	36.459	5.103	7.1	0.0000
crim	-0.108	0.033	-3.3	0.0011
zn	0.046	0.014	3.4	0.0008
indus	0.021	0.061	0.3	0.7383
chas	2.687	0.862	3.1	0.0019
nox	-17.767	3.820	-4.7	0.0000
rm	3.810	0.418	9.1	0.0000
age	0.001	0.013	0.1	0.9582
dis	-1.476	0.199	-7.4	0.0000
rad	0.306	0.066	4.6	0.0000
tax	-0.012	0.004	-3.3	0.0011
ptratio	-0.953	0.131	-7.3	0.0000
black	0.009	0.003	3.5	0.0006
lstat	-0.525	0.051	-10.3	0.0000

(f) Think about what the variables in the data set mean. Do the signs of all of the coefficient estimates make sense? Are there any that do not? For the ones that do not, are the coefficients statistically significant (do they have p-value < 0.05)?

# Variables with positive signs
names(which(coef(lm.fit) > 0))

## [1] "(Intercept)" "zn"          "indus"       "chas"        "rm"         
## [6] "age"         "rad"         "black"

# Variables with negative signs
names(which(coef(lm.fit) < 0))

## [1] "crim"    "nox"     "dis"     "tax"     "ptratio" "lstat"

Variables with positive coefficients:

zn : proportion of residential land zoned for lots over 25,000 sq.ft. It makes sense that neighbourhoods with larger lots have higher home values
indus : proportion of non-retail business acres per town. This term isn’t statistically significant, so we shouldn’t worry of this sign makes sense.
chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). It’s possible that the neighbourhoods bordering the Charles river are more desirable.
rm : average number of rooms per dwelling. More bedrooms, higher price. Makes total sense.
age : proportion of owner-occupied units built prior to 1940. So-called “pre-war” buildings on the East coast tend to fetch higher prices. The coefficient is not statistically significant (perhaps due to collinearity with other variables).
rad : index of accessibility to radial highways. Highway access = easier commutes, which can make neighbourhoods more desirable/expensive.
black : where Bk is the proportion of blacks by town. This variable is hard to interpret. Why are we looking at (Bk - 0.63)^2?

Variables with negative coefficients:

crim : per capita crime rate by town. Neighbourhoods with high crime tend to be less desirable
nox : nitrogen oxides concentration (parts per 10 million). Neighbourhoods with less air pollution tend to be more expensive
dis : weighted mean of distances to five Boston employment centres. People will pay more to live close to the main employment centres.
tax : full-value property-tax rate per $10,000. Neighbourhoods with higher tax rates tend to have lower housing prices, all else being equal. This also makes sense.
ptratio : pupil-teacher ratio by town People will pay more to send their kids to schools where there are fewer students for every teacher. Makes sense.
lstat : % of pop with low socioeconomic status: Makes sense. Less affluent neighbourhoods have lower home values.

5. Non-linear transformations of the predictors

(a) Perform a regression of `medv` onto a quadratic polynomial of `lstat` by using the formula `medv ~ lstat + I(lstat^2)`. Use the `summary()` function to display the estimated coefficients. Is the coefficient of the squared term statistically significant?

summary(lm(medv ~ lstat + I(lstat^2), data = Boston))

## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

Yes, the coefficient of I(lstat^2) is highly statistically significant.

(b) Try using the formula `medv ~ lstat + lstat^2` instead. What happens?

summary(lm(medv ~ lstat + lstat^2, data = Boston))

## 
## Call:
## lm(formula = medv ~ lstat + lstat^2, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

This just fits the same model as medv ~ lstat. The I() function is important

(c) Use the formula `medv ~ poly(lstat, 2)`. Compare your results to part (a).

summary(lm(medv ~ poly(lstat, 2), data = Boston))

## 
## Call:
## lm(formula = medv ~ poly(lstat, 2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.5328     0.2456   91.76   <2e-16 ***
## poly(lstat, 2)1 -152.4595     5.5237  -27.60   <2e-16 ***
## poly(lstat, 2)2   64.2272     5.5237   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

These coefficients don’t look like the ones from part (a). This is because, while they are fitting the same model (you can check that the R-squared, F-statistic, DF, etc. are all the same), the poly(lstat, 2) function uses an orthonormalized representation of the data. To get exactly the model from part (a), we can specify raw = TRUE

summary(lm(medv ~ poly(lstat, 2, raw = TRUE), data = Boston))

## 
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 42.862007   0.872084   49.15   <2e-16 ***
## poly(lstat, 2, raw = TRUE)1 -2.332821   0.123803  -18.84   <2e-16 ***
## poly(lstat, 2, raw = TRUE)2  0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

6. ggplot visualizations

ggplot’s stat_smooth command allows us to visualize simple regression models in a really easy way. This set of problems helps you get accustomed to specifying polynomial and step function formulas for the purpose of visualization.

For this problem, please refer to the code posted here: Week 1 R code

(a) Use `ggplot` graphics to construct a scatterplot of `medv` vs `lstat`, overlaying a 2nd degree polynomial. Does this appear to be a good model of the data? Construct plots with higher degree polynomial fits. Do any of them appear to describe the data particularly well?

qplot(data = Boston, x = lstat, y = medv,
      xlab = "% of individuals of low socioeconomic status", 
      ylab = "Median home value ($1000's)") + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 2)) +
  ggtitle("medv ~ poly(lstat, 2)")

The quadratic model fits the data OK. It has poor behavior for lstat above 25%: The data does not indicate that home values increase as poverty increases.
Here’s a 4th degree polynomial, which seems to do a better job of fitting hte data in the <30% lstat range. The behavior out at lstat > 30% is still questionable.

qplot(data = Boston, x = lstat, y = medv,
      xlab = "% of individuals of low socioeconomic status", 
      ylab = "Median home value ($1000's)") + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 4)) +
  ggtitle("medv ~ poly(lstat, 4)")

(b) Repeat part (a), but this time using step functions instead of polynomials. Try picking cuts to best match the trends in the data. Which functional form appears to do a better job of describing the data: polynomials, or step functions? Explain.

Here’s one reasonable choice.

qplot(data = Boston, x = lstat, y = medv,
      xlab = "% of individuals of low socioeconomic status", 
      ylab = "Median home value ($1000's)") + 
  stat_smooth(method = "lm", 
              formula = y ~ cut(x, breaks = c(-Inf, 5, 10, 15, 20, Inf))) +
  ggtitle("Step functions")

The polynomial fit seems to do a better job of capturing the trends in the data. Step functions work well if there are abrupt changes in the behavior of $Y$ as $x$ varies. We don’t really expect this in the given data.

(c) Repeat part (a), this time using `ptratio` as the x-axis variable, and `medv` still as the y-axis variable.

qplot(data = Boston, x = ptratio, y = medv,
      xlab = "pupil : teacher ratio", 
      ylab = "Median home value ($1000's)") + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 2)) +
  ggtitle("medv ~ poly(ptratio, 2)")

It looks like we don’t really need the quadratic part. Here’s the linear model fit.

qplot(data = Boston, x = ptratio, y = medv,
      xlab = "pupil : teacher ratio", 
      ylab = "Median home value ($1000's)") + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 1)) +
  ggtitle("Linear model: medv ~ ptratio")

Looks decent.

(d) Repeat part (b), this time with `ptratio` instead of `lstat`.

Since a linear model looks more-or-less appropriate, it’s going to take a lot of breaks in order to replicate this model with step functions. Here’s one attempt. It’s really not worth the added complexity.

qplot(data = Boston, x = ptratio, y = medv,
      xlab = "pupil : teacher ratio", 
      ylab = "Median home value ($1000's)") + 
  stat_smooth(method = "lm", 
              formula = y ~ cut(x, breaks = c(-Inf, 13.5, 15.5, 18.5, 20, Inf))) +
  ggtitle("Step functions")

Lab 1

Your Name Here

1. Changing the author field and file name.

(a) Change the author: field on the Rmd document from Your Name Here to your own name.

(b) Rename this file to “lab01_YourHameHere.Rmd”, where YourNameHere is changed to your own name.

2. Installing and loading packages

3. Simple Linear Regression with the Boston Housing data.

(a) Use the dim() command to figure out the number of rows and columns in the Boston housing data

(b) Use the nrow() and ncol() commands to figure out the number of rows and columns in the Boston housing data.

(c) Use the names() command to see which variables exist in the data. Which of these variables is our response variable? What does this response variable refer to? How many input variables do we have?

(d) Use the lm() function to a fit linear regression of medv on lstat. Save the output of your linear regression in a variable called lm.fit.

(e) Use the summary() command on your lm.fit object to get a print-out of your regression results

(f) Uncomment the line below to get a ‘nice’ printout of the coefficients table

(g) Call names() on lm.fit to explore what values this linear model object contains.

(h) Use the coef() function to get the estimated coefficients. What is the estimated Intercept? What is the coefficient of lstat in the model? Interpret this coefficient.

(i) Here’s a ggplot command that overlays a linear regression line on a scatterplot of mdev vs. lstat. Edit the xlab and ylab arguments to produce more meaningful axis labels. Does the linear model appear to fit the data well? Explain.