This problem once again uses the bikes
data.
(a) Use the transform
and mapvalues
functions to map the values of the season
and weathersit
variables to something more interpretable. Your new weathersit
variable should have levels: Clear, Cloud, Light.precip, Heavy.precip
# Edit me
(b) Fit a linear regression model with cnt
as the outcome, and season, workingday, weathersit, temp, atemp and hum as covariates. Use the summary
function to print a summary of the model showing the model coefficients.
Note that you may wish to code the workingday variable as a factor, though this will not change the estimated coefficient or its interpretation (workingday is already a 0-1 indicator).
# Edit me
(c) How do you interpret the coefficient of workingday
in the model? What is its p-value? Is it statistically significant?
# Edit me
(d) How do you interpret the coefficient corresponding to Light.precip weather situation? What is its p-value? Is it statistically significant?
# Edit me
(e) Use the pairs
function to construct a pairs plot of temp, atemp and hum. The bottom panel of your plot should show correlations (see example in Lecture 9).
# Edit me
Do you observe any strong colinearities between the variables?
# Edit me
(f) Use the update
function to update your linear model by removing the temp
variable. Has the coefficient of atemp
changed? Has it changed a lot? Explain.
# Edit me
(g) How do you interpret the coefficient of atemp
in the model from part (f)?
(h) Use the anova()
function on your model from part (f) to assess whether weathersit
is a statistically significant factor in the model. Interpret your findings.
# Edit me
This problem will guide you through the process of conducting a simulation to investigate statistical significance in regression.
In this setup, we’ll repeatedly simulate n
observations of 3 variables: an outcome y
, a covariate x1
that’s associated with y
, and a covariate x2
that is unassociated with y
. Our model is:
\[ y_i = 0.5 x_1 + \epsilon_i\]
where the \(\epsilon_i\) are independent normal noise variables having standard deviation 5 (i.e., Normal random variables with mean 0 and standard deviation 5).
Here’s the setup.
set.seed(12345) # Set random number generator
n <- 200 # Number of observations
x1 <- runif(n, min = 0, max = 10) # Random covariate
x2 <- rnorm(n, 0, 10) # Another random covariate
To generate a random realization of the outcome y
, use the following command.
# Random realization of y
y <- 0.5 * x1 + rnorm(n, mean = 0, sd = 5)
Here’s are plots of that random realization of the outcome y
, plotted against x1
and x2
.
qplot(x = x1, y = y)
qplot(x = x2, y = y)
(a) Write code that implements the following simulation (you’ll want to use a for loop):
for 2000 simulations:
generate a random realization of y
fit regression of y on x1 and x2
record the coefficient estimates, standard errors and p-values for x1 and x2
At the end you should have 2000 instances of estimated slopes, standard errors, and corresponding p-values for both x1
and x2
. It’s most convenient to store these in a data frame.
# Note the cache = TRUE header here. This tells R Markdown to store the output of this code chunk and only re-run the code when code in this chunk changes. By caching you won't wind up re-running this code every time you knit.
# Edit me
(b) This problem has multiple parts.
x1
.x1
. Is the average close to the true value?x1
. Calculate the standard deviation of the coefficient estimates for x1
. Are these numbers similar?# Edit me
Take-away from this problem: the Std. Error
value in the linear model summary is an estimate of the standard deviation of the coefficient estimates.
(c) Repeat part (b) for x2
.
# Edit me
(d) Construct a histogram of the p-values for the coefficient of x1
. What do you see? What % of the time is the p-value significant at the 0.05 level?
# Edit me
(e) Repeat part (d) with x2
. What % of the time is the p-value significant at the 0.05 level?
# Edit me
(f) Given a coefficient estimate \(\hat \beta\) and a standard error estimate \(\hat{se}(\hat\beta)\), we can construct an approximate 95% confidence interval using the “2 standard error rule”. i.e., \[ [\hat \beta - 2 \hat{se}, \, \hat \beta + 2 \hat{se}] \] is an approximate 95% confidence interval for the true unknown coefficient.
As part of your simulation you stored \(\hat \beta\) and \(\hat{se}\) values for 2000 simulation instances. Use these estimates to construct approximate confidence intervals and answer the following questions.
x1
actually contain the the true value of the coefficient (\(\beta_1 = 0.5\)).Replace this text with your answer. (do not delete the html tags)
x2
actually contain the the true value of the coefficient (\(\beta_2 = 0\)).Replace this text with your answer. (do not delete the html tags)
On homework 4 you completed several exercises that involved using ddply
to calculate error bars via the t.test
command. When the x-axis variable has one or more categories that have low counts, it’s possible for this approach to fail. Here’s an example of this issue.
# Data import & transformation
# DO NOT CHANGE THIS CODE
income.data <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/homework/nlsy97_income.csv", header=TRUE)
income.data <- subset(income.data,
select = c("R0536300", "T7545600", "T6656900"),
subset = T6656900 %in% c(1:9, 11))
income.data[income.data < 0] <- NA
income.data <- na.omit(income.data)
colnames(income.data) <- c("gender", "income", "household.size")
income.data <- transform(income.data,
gender = mapvalues(gender, 1:2, c("Male", "Female")))
Let’s look at how many observations we have at every level of household size:
with(income.data, table(household.size, gender))
## gender
## household.size Female Male
## 1 252 404
## 2 686 757
## 3 586 646
## 4 489 547
## 5 277 263
## 6 121 106
## 7 52 42
## 8 22 16
## 9 17 5
## 11 3 1
We can see that counts are very low in some of the categories. For instance, we have just 5 men in households of size 9.
(a) Construct a modified t-test function that operates in the following way (CI below stands for “confidence interval”):
x
: numeric vectorgroup
: 2-level factor vector of the same length as x
k
: integer, giving minimum group sizeflip.sign
: logical. Should CI and mean be flipped?lower
: the lower value of the error bar (CI)upper
: the upper value of the error bar (CI)is.signif
: for groups of size >= k
, this value should be 0 if CI overlaps 0, and 1 if it doesn’t. For groups of size < k
, this value should always be 0.group
has at least k
observations in each level, return the lower and upper confidence interval values from running t.test
with formula x ~ group
.group
has fewer than k
observations in one or both levels of this factor, lower
and upper
should both be set equal to the mean difference.flip.sign
= TRUE, the difference and confidence interval should be calculate as group2 - group1 instead of group1 - group2.# Add comments to this function
betterTTestErrorBars <- function(x, groups, k = 5, flip.sign = TRUE) {
# Edit me
}
(b) Modify the code below to use the betterTTestErrorBars
command instead of the t.test
command to produce a bar chart with error bars showing how the income gap varies with household size. Specify that the bars should be filled based on is.significant
. Note: you’ll need to set eval = TRUE in the code chunk header for the code to run
# Calculate income gaps (male - female) and 95% confidence intervals
# for the gap
# MODIFY THIS CODE
gap.data.conf <- ddply(nlsy, ~ household.size, summarize,
income.gap = mean(income[gender == "Male"], na.rm = TRUE) - mean(income[gender == "Female"], na.rm = TRUE),
upper = -t.test(income ~ gender)$conf.int[1],
lower = -t.test(income ~ gender)$conf.int[2],
is.significant = as.numeric(t.test(income ~ gender)$p.value < 0.05))
# Re-order the race factor according to gap size
gap.data.conf <- transform(gap.data.conf,
race = reorder(race, income.gap))
# Plot, with error bars
# MODIFY THIS CODE TO WORK FOR household.size INSTEAD OF race
ggplot(data = gap.data.conf, aes(x = race, y = income.gap)) +
geom_bar(stat = "identity") +
xlab("Race") +
ylab("Income gap($)") +
ggtitle("Income gap between men and women, by race") +
guides(fill = FALSE) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
theme(text = element_text(size=12))
For this problem we’ll return to the gapminder
data set from Lecture 10.
gapminder <- read.delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/gapminder_five_year.txt")
(a) Use dlply
to produce a list of linear regression models, one for each country, regressing gdpPercap
on year
.
# Edit me
(b) Use ldply
on your list from part (a) to produce a data frame displaying for each country whether the coefficient of year
was significant at the 0.05 level. Your output should be a two column data frame: the first column gives the country, and the second column displays a 0 if the coefficient of year
was not significant, and a 1 if it was significant.
# Edit me
(c) The following code produces a data frame giving the continent for each country.
summary.continent <- ddply(gapminder, ~ country, summarize, continent = unique(continent))
Use the merge
function to merge your data frame from part (b) with summary.continent
to produce a data frame that shows both the slope significance indicator and also the continent. (See Lecture 10 notes for examples of how to do this.)
# Edit me
(d) Using your data frame from part (c), produce a contingency table showing counts for each combination of slope significance and continent.
# Edit me
(e) What do you observe in the table from part (d)? What does a non-significant slope (coefficient of year) suggest about a country’s economic growth?
Replace this text with your answer. (do not delete the html tags)