Final project practice: Analysis of a different NLSY data set
For the purpose of this analysis we’re going to look at a different version of the NLSY data set. Here I’m loading in data from the 1979 cohort. Note that for your final project you’re working with a more contemporary cohort, the 1997 cohort.
nlsy <- read_csv("http://www.andrew.cmu.edu/user/achoulde/94842/final_project/nlsy79/nlsy79_income.csv")
## Parsed with column specification:
## cols(
## .default = col_double()
## )
## See spec(...) for full column specifications.
# Change column names (mostly) to question name abbreviations
colnames(nlsy) <- c("VERSION_R25_2012",
"CASEID_1979",
"FAM-2A_1979",
"FAM-POB_1979",
"FAM-3_1979",
"FAM-3A_1979",
"FAM-RES_1979",
"FAM-6_1979",
"R_REL-1_COL_1979",
"SCHOOL-31_1979",
"MIL-6_1979",
"WOMENS-ROLES_000001_1979",
"WOMENS-ROLES_000002_1979",
"WOMENS-ROLES_000003_1979",
"WOMENS-ROLES_000004_1979",
"WOMENS-ROLES_000006_1979",
"WOMENS-ROLES_000007_1979",
"WOMENS-ROLES_000008_1979",
"EXP-OCC_1979",
"EXP-9_1979",
"race",
"gender",
"MARSTAT-KEY_1979",
"FAMSIZE_1979",
"POVSTATUS_1979",
"POLICE-1_1980",
"POLIC-1C_1980",
"POLICE-2_1980",
"ALCH-2_1983",
"DS-8_1984",
"DS-9_1984",
"Q13-5_TRUNC_REVISED_1990",
"POVSTATUS_1990",
"HGCREV90_1990",
"jobs.num",
"NUMCH90_1990",
"AGEYCH90_1990",
"DS-12_1998",
"DS-13_1998",
"INDALL-EMP.01_2000",
"CPSOCC80.01_2000",
"OCCSP-55I_CODE_2000",
"Q2-15B_2000",
"Q10-2_2000",
"Q13-5_TRUNC_REVISED_2000",
"FAMSIZE_2000",
"TNFI_TRUNC_2000",
"POVSTATUS_2000",
"MARSTAT-COL_2000",
"MARSTAT-KEY_2000",
"MO1M1B_XRND",
"Q2-10B~Y_2012",
"INDALL-EMP.01_2012",
"OCCALL-EMP.01_2012",
"OCCSP-55I_CODE_2012",
"Q2-15A_2012",
"Q12-6_2012",
"income",
"Q13-5_SR000001_2012",
"Q13-5_SR000002_2012",
"Q13-18_TRUNC_2012",
"Q13-18_SR000001_TRUNC_2012",
"FAMSIZE_2012",
"REGION_2012",
"HGC_2012",
"URBAN-RURAL_2012",
"JOBSNUM_2012")
# Map all negative values to missing (NA)
# DO NOT DO THIS IN YOUR PROJECT WITHOUT CAREFUL JUSTIFICATION
nlsy[nlsy < 0] <- NA
We’ll now walk through the following tasks: - Running a regression model - Using anova() to compare two nested regression models to assess whether a categorical variable is statistically significant - Updating models to include interaction terms - Interpreting coefficients of categorical x categorical interaction terms - Testing for statistial significance of an interaction term in a model - Getting a similar but simplified analysis with figures and summary tables
(a) Run a linear regression modeling income
as a linear function of gender
, race
and jobs.num
.
# Transform and relabel gender and race variables
nlsy <- mutate(nlsy,
gender = recode_factor(gender,
`2` = "Female",
`1` = "Male"),
race = recode_factor(race,
`2` = "Black",
`3` = "Other",
`1` = "Hispanic"))
Here we’re setting “Female” as the baseline category for the gender variable and “Black” as the baseline for the race category. For race, “Other” (non-Black, non-Hispanic, a category that includes “White”) would be a more natural baseline for interpretability purposes. In your final project you’ll want to think carefully about what baseline categories you select. I am not picking the best baselines here because I want to show you how to interpret coefficients in general, even if the baseline isn’t ideally chosen.
# Run regression
nlsy.lm <- lm(income ~ gender + race + jobs.num, data = nlsy)
# Output summary
summary(nlsy.lm)
##
## Call:
## lm(formula = income ~ gender + race + jobs.num, data = nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66845 -29250 -11189 14642 319529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19079.7 1632.6 11.686 < 2e-16 ***
## genderMale 24895.5 1316.5 18.910 < 2e-16 ***
## raceOther 22870.1 1507.0 15.176 < 2e-16 ***
## raceHispanic 8929.8 1899.9 4.700 0.00000265 ***
## jobs.num -370.8 146.4 -2.533 0.0113 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53620 on 6738 degrees of freedom
## (5943 observations deleted due to missingness)
## Multiple R-squared: 0.08067, Adjusted R-squared: 0.08012
## F-statistic: 147.8 on 4 and 6738 DF, p-value: < 2.2e-16
(b) Use the anova()
function to assess whether race is a statistically significant predictor of income by comparing your model from part (a) to a model with just gender
and jobs.num
.
Hint: You can use the update()
function to drop predictors from a model.
anova(update(nlsy.lm, . ~ . - race), nlsy.lm)
## Analysis of Variance Table
##
## Model 1: income ~ gender + jobs.num
## Model 2: income ~ gender + race + jobs.num
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6740 20059555089013
## 2 6738 19369215650204 2 690339438809 120.07 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Race turns out to be a highly statistically signficant predictor of income in the model. You can see this also by looking at the coefficient estimates and standard errors — the estimated coefficients are very large, with relatively small standard errors.
(c) Update your linear regression model from part (a) to also include an interaction term between race
and gender
.
nlsy.lm.interact <- update(nlsy.lm, . ~ . + race*gender)
summary(nlsy.lm.interact)
##
## Call:
## lm(formula = income ~ gender + race + jobs.num + gender:race,
## data = nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71941 -29710 -10500 14716 316605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25837.0 1859.0 13.898 < 2e-16 ***
## genderMale 9551.9 2361.5 4.045 0.00005293 ***
## raceOther 9924.3 2084.9 4.760 0.00000198 ***
## raceHispanic 4149.5 2620.5 1.583 0.1134
## jobs.num -276.2 146.0 -1.892 0.0585 .
## genderMale:raceOther 26627.8 2985.7 8.918 < 2e-16 ***
## genderMale:raceHispanic 9788.9 3778.6 2.591 0.0096 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53290 on 6736 degrees of freedom
## (5943 observations deleted due to missingness)
## Multiple R-squared: 0.09194, Adjusted R-squared: 0.09113
## F-statistic: 113.7 on 6 and 6736 DF, p-value: < 2.2e-16
(d) How many coefficients are estimated for this interaction term?
There are 2 estimated coefficients.
(e) What do you think these coefficients mean? How would you interpret them?
coef.race.other <- round(coef(nlsy.lm.interact)["genderMale:raceOther"], 0)
coef.race.other
## genderMale:raceOther
## 26628
These coefficients correspond to differences in earnings across gender and race that aren’t accounted for by a model that assumes independent effects for race and gender. That is, the income difference between men and women appears to depend strongly on the individual’s race. For instance, the income difference among mixed-race men and mixed-race women is $26628 more than the difference between black men and black women. Race is a factor that is highly associated with the income gap between men and women.
Note: This is different from saying that race is associated with income. The interaction terms are estimates of how the income gap differs across the race categories.
Main effects vs. interactions
The final project asks you to explore the question of whether there are any factors that exacerbate or mitigate the income gap between men and women. It’s important to note that this is different from asking whether there are factors that affect income. While it is interesting to investigate factors that affect income, there may be factors that affect income but do not affect the income gap. This is the difference between significant main effects and significant interactions.
Let’s look at the two linear models again.
# Note how digits is specified here to round each column to a different number of decimal values
kable(coef(summary(nlsy.lm)), digits = c(0, 0, 2, 4))
(Intercept) |
19080 |
1633 |
11.69 |
0.0000 |
genderMale |
24896 |
1317 |
18.91 |
0.0000 |
raceOther |
22870 |
1507 |
15.18 |
0.0000 |
raceHispanic |
8930 |
1900 |
4.70 |
0.0000 |
jobs.num |
-371 |
146 |
-2.53 |
0.0113 |
kable(coef(summary(nlsy.lm.interact)), digits = c(0, 0, 2, 4))
(Intercept) |
25837 |
1859 |
13.90 |
0.0000 |
genderMale |
9552 |
2361 |
4.04 |
0.0001 |
raceOther |
9924 |
2085 |
4.76 |
0.0000 |
raceHispanic |
4150 |
2621 |
1.58 |
0.1134 |
jobs.num |
-276 |
146 |
-1.89 |
0.0585 |
genderMale:raceOther |
26628 |
2986 |
8.92 |
0.0000 |
genderMale:raceHispanic |
9789 |
3779 |
2.59 |
0.0096 |
According to the first model, the average income difference between an Hispanic male and a Hispanic female (both of whom have held, say, 3 jobs) is just:
Estimated.income(Male, Hispanic, 3 jobs) - Estimated.income(Female, Hispanic, 3 jobs)
= ( (Intercept)
+ genderMale
+ raceHispanic
+ 3 * jobs.num
) - ((Intercept)
+ raceHispanic
+ 3 * jobs.num
)
= gendermale
= 24896
Indeed, this is the average difference in income between men and women of the same jobs.num and same race, regardless of their specific jobs.num and race.
According to the second model, the one that contains race-gender interactions, the average difference between the same two individuals is given by:
Estimated.income(Male, Hispanic, 3 jobs) - Estimated.income(Female, Hispanic,3 jobs)
= ( (Intercept)
+ genderMale
+ raceHispanic
+ 3 * jobs.num
+ gendermale:raceHispanic
) - ((Intercept)
+ raceHispanic
+ 3 * jobs.num
)
= genderMale
+ genderMale:raceHispanic
= 19341
This difference doesn’t depend on the number of jobs that we assumed both individuals have held—we would get the same difference if we assumed that both individuals held 1 prior job, or 10 prior jobs—but it does depend on the assumed race. Running the same calculation to compare a non-Black, non-Hispanic race man and a non-Black non-Hispanic race woman both of whom held x
prior jobs, we would get that the average income difference is:
Estimated.income(Male, Other, x jobs) - Estimated.income(Female, Other, x jobs)
= gendermale
+ genderMale:raceOther
= 36180
The baseline level here is race = ‘Black’. So the model also tells us that the average income difference between Black men and women with the same number of prior jobs:
Estimated.income(Male, Black, x jobs) - Estimated.income(Female, Black, x jobs)
= gendermale
+ 0
= 9552
By looking at the interaction effect between race
and gender
, we do find evidence that race is a factor affecting the income gap between men and women. E.g., There is a significantly larger gap (in an absolute $ sense) between Hispanic men and women race compared to Black men and women.
Testing significance of the interaction term with anova
By looking at the individual p-values for the interaction term coefficients, we can answer the question of whether the difference in income gap differs between Black individuals and those of a different given race category. To more directly answer the question of whether the observed differences in income gap across the race categories are statistically significant, we need to use the anova
command.
anova(nlsy.lm, nlsy.lm.interact)
## Analysis of Variance Table
##
## Model 1: income ~ gender + race + jobs.num
## Model 2: income ~ gender + race + jobs.num + gender:race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6738 19369215650204
## 2 6736 19131704011553 2 237511638651 41.812 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is statistically significant, so we can reject the null hypothesis that the income gap is the same across all the race categories. In other words, the data suggests that the income gap between men and women does vary with race.
The regression-free approach
In the previous examples we used a regression because we wanted to adjust for the jobs.num
variable. However, before delving into a regression, you can get a lot of the same insights with some simple plots and tables.
Here’s a set of commands that calculates the difference in average income between men and women for each race category.
nlsy %>%
group_by(race) %>%
summarize(income.gap = mean(income[gender == "Male"], na.rm = TRUE) -
mean(income[gender == "Female"], na.rm = TRUE))
## # A tibble: 3 x 2
## race income.gap
## <fct> <dbl>
## 1 Black 8683.
## 2 Other 35543.
## 3 Hispanic 18190.
This is a great, easy-to-interpret table that you can include in your analysis. (Remember to round! — not shown here.) By changing out race
for other categorical variables, you can try to identify other factors that may affect the income gap.
Here’s a one-line plotting command that summarizes the previous table in an east-to-interpret plot.
gap.data <- nlsy %>%
group_by(race) %>%
summarize(income.gap = mean(income[gender == "Male"], na.rm = TRUE) -
mean(income[gender == "Female"], na.rm = TRUE))
ggplot(data = gap.data, aes(x = race, y = income.gap, fill = race)) +
geom_bar(stat = "identity") +
xlab("Race") +
ylab("Income gap($)") +
ggtitle("Income gap between men and women, by race") +
guides(fill = FALSE)
We’ll now look at a couple of ways of modifying/improving this graphic.
Re-order the variables on the x axis so that the bars are sorted by in descending order of height. i.e., plot the bars in descending order of wage gap.
By running t-tests you can obtain confidence intervals for the income gaps. You can plot these using the geom_errorbar
layer in ggplot.
Stratified regressions with ggplot and dplyr
Gapminder life expectancy data
gapminder <- read_delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/gapminder_five_year.txt",
delim = "\t") # Load data
## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_double(),
## pop = col_double(),
## continent = col_character(),
## lifeExp = col_double(),
## gdpPercap = col_double()
## )
gapminder # A look at the data
## # A tibble: 1,704 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779.
## 2 Afghanistan 1957 9240934 Asia 30.3 821.
## 3 Afghanistan 1962 10267083 Asia 32.0 853.
## 4 Afghanistan 1967 11537966 Asia 34.0 836.
## 5 Afghanistan 1972 13079460 Asia 36.1 740.
## 6 Afghanistan 1977 14880372 Asia 38.4 786.
## 7 Afghanistan 1982 12881816 Asia 39.9 978.
## 8 Afghanistan 1987 13867957 Asia 40.8 852.
## 9 Afghanistan 1992 16317921 Asia 41.7 649.
## 10 Afghanistan 1997 22227415 Asia 41.8 635.
## # … with 1,694 more rows
Example: Fitting a linear model for each country
We’re now going to go through an example where we get the life expectancy in 1952 and the rate of change in life expectancy over time for each country. The rate of change will be obtained by regressing lifeExp
on year
.
What can we learn from this output?
Let’s summarize our findings by creating a bar chart for the origin and slope, with the bars colored by continent. We’ll do this in ggplot.
Plotting origins coloured by continent
This code is analogous to that presented at the end of Lecture 6
# Reorder country factor by origin
# Construct bar chart
progress.df %>%
mutate(country = reorder(country, origin)) %>%
ggplot(aes(x = country, y = origin, fill = continent)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
Plotting slopes coloured by continent
# Reorder country factor by slope
# Construct bar chart
progress.df %>%
mutate(country = reorder(country, slope)) %>%
ggplot(aes(x = country, y = slope, fill = continent)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
These are very interesting plots. What can you tell from looking at them?
Looking at per capita GDP by year
Let’s start by looking at some plots of how GDP per capita varied by year
# Use qplot from ggplot2 to generate plots
qplot(year, gdpPercap, facets = ~ country, data = gapminder, colour = continent) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
What if we want to rearrange the plots by continent? This can be done by changing the order of the country
level.
# First step: reorder the countries by continent
# Produce a data frame that has the countries ordered alphabetically within continent
# Arrange sorts the data according to the variable(s) provided
country.df <- gapminder %>%
group_by(country) %>%
summarize(continent = continent[1]) %>%
arrange(continent)
gapminder.ordered <- gapminder %>%
mutate(country = factor(country, levels = country.df$country))
# Let's make sure that things are now ordered correctly...
levels(gapminder.ordered$country)
## [1] "Algeria" "Angola"
## [3] "Benin" "Botswana"
## [5] "Burkina Faso" "Burundi"
## [7] "Cameroon" "Central African Republic"
## [9] "Chad" "Comoros"
## [11] "Congo, Dem. Rep." "Congo, Rep."
## [13] "Cote d'Ivoire" "Djibouti"
## [15] "Egypt" "Equatorial Guinea"
## [17] "Eritrea" "Ethiopia"
## [19] "Gabon" "Gambia"
## [21] "Ghana" "Guinea"
## [23] "Guinea-Bissau" "Kenya"
## [25] "Lesotho" "Liberia"
## [27] "Libya" "Madagascar"
## [29] "Malawi" "Mali"
## [31] "Mauritania" "Mauritius"
## [33] "Morocco" "Mozambique"
## [35] "Namibia" "Niger"
## [37] "Nigeria" "Reunion"
## [39] "Rwanda" "Sao Tome and Principe"
## [41] "Senegal" "Sierra Leone"
## [43] "Somalia" "South Africa"
## [45] "Sudan" "Swaziland"
## [47] "Tanzania" "Togo"
## [49] "Tunisia" "Uganda"
## [51] "Zambia" "Zimbabwe"
## [53] "Argentina" "Bolivia"
## [55] "Brazil" "Canada"
## [57] "Chile" "Colombia"
## [59] "Costa Rica" "Cuba"
## [61] "Dominican Republic" "Ecuador"
## [63] "El Salvador" "Guatemala"
## [65] "Haiti" "Honduras"
## [67] "Jamaica" "Mexico"
## [69] "Nicaragua" "Panama"
## [71] "Paraguay" "Peru"
## [73] "Puerto Rico" "Trinidad and Tobago"
## [75] "United States" "Uruguay"
## [77] "Venezuela" "Afghanistan"
## [79] "Bahrain" "Bangladesh"
## [81] "Cambodia" "China"
## [83] "Hong Kong, China" "India"
## [85] "Indonesia" "Iran"
## [87] "Iraq" "Israel"
## [89] "Japan" "Jordan"
## [91] "Korea, Dem. Rep." "Korea, Rep."
## [93] "Kuwait" "Lebanon"
## [95] "Malaysia" "Mongolia"
## [97] "Myanmar" "Nepal"
## [99] "Oman" "Pakistan"
## [101] "Philippines" "Saudi Arabia"
## [103] "Singapore" "Sri Lanka"
## [105] "Syria" "Taiwan"
## [107] "Thailand" "Vietnam"
## [109] "West Bank and Gaza" "Yemen, Rep."
## [111] "Albania" "Austria"
## [113] "Belgium" "Bosnia and Herzegovina"
## [115] "Bulgaria" "Croatia"
## [117] "Czech Republic" "Denmark"
## [119] "Finland" "France"
## [121] "Germany" "Greece"
## [123] "Hungary" "Iceland"
## [125] "Ireland" "Italy"
## [127] "Montenegro" "Netherlands"
## [129] "Norway" "Poland"
## [131] "Portugal" "Romania"
## [133] "Serbia" "Slovak Republic"
## [135] "Slovenia" "Spain"
## [137] "Sweden" "Switzerland"
## [139] "Turkey" "United Kingdom"
## [141] "Australia" "New Zealand"
# Use qplot from ggplot2 to generate plots
qplot(year, gdpPercap, facets = ~ country, data = gapminder.ordered, colour = continent) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Setting the y-axis free.
One of the issues with the previous plot is that, while you’re able to see overall trends for some countries, the y-axis scaling swamps the range of data for many others. E.g., the plots for many African countries look like essentially flat lines close to 0. If you want readers to be able to view the data better for each individual country, you can use the scales = "free_y"
argument in facet_wrap. Here’s some code that does this.
ggplot(data = gapminder.ordered,
aes(x = year, y = gdpPercap, colour = continent)) +
geom_point() +
facet_wrap(~ country, scales = "free_y") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'