This homework is designed to give you practice with calculating error bars (confidence intervals) with ddply and using ggplot2 graphics to produce insightful plots of the results.
library(plyr)
library(dplyr)
library(ggplot2)
You will continue using the adult
data set that you first encountered on Homework 3. This data set is loaded below.
adult.data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header=FALSE, fill=FALSE, strip.white=T,
col.names=c("age", "type_employer", "fnlwgt", "education",
"education_num","marital", "occupation", "relationship", "race","sex",
"capital_gain", "capital_loss", "hr_per_week","country", "income"))
adult.data <- mutate(adult.data,
high.income = as.numeric(income == ">50K"))
ddply
and 1-sample t-testing, construct a table that shows the average capital_gain
across education
, along with the lower and upper endpoints of a 95% confidence interval. Your table should look something like: education mean lower upper
1 10th 404.5745 91.893307 717.2557
2 11th 215.0979 144.306937 285.8888
3 12th 284.0878 126.824531 441.3510
...
# Edit me
factor(..., levels = ...)
command helpful here. For the post-high school grades, you can use the ordering: Assoc-voc, Assoc-acdm, Some-college, Bachelors, Masters, Prof-school, Doctorate.# Edit me
geom_errorbar
to overlay error bars as specified by the confidence interval endpoints you computed. You should tilt your x-axis text to limit overlap of x-axis labels. Set an appropriate y-axis label.# Edit me
Your answer goes here!
ddply
and 2-sample t-testing, construct a table that shows the difference in the proportion of men and women earning above 50K across different employer types. E.g., if 20% of men and 15% of women in a group earn about 50K, the difference in proportion is 0.2 - 0.15 = 0.05. Your table should use the 2-sample t-test to also calculate the lower and upper endpoints of a 95% confidence interval. (While a t-test isn’t appropriate for binary data when the number of observations is small, we’ll ignore this issue for now.) Your table should look something like: type_employer prop.diff lower upper
1 ? 0.07743971 0.0504165 0.1044629
2 Federal-gov 0.31059432 0.2532462 0.3679424
3 Local-gov 0.18361338 0.1461258 0.2211009
...
# Edit me
Your answer goes here!
is.nan
function useful here.# Edit me
geom_errorbar
to overlay error bars as specified by the confidence interval endpoints you computed. You should tilt your x-axis text to limit overlap of x-axis labels. Set an appropriate y-axis label.# Edit me
reorder
command from Lecture 7. Display the plot with the re-ordered x-axis variable.# Edit me
Your answer goes here!
Your answer goes here!
education mean lower upper is.signif
1 10th 404.5745 91.893307 717.2557 1
2 11th 215.0979 144.306937 285.8888 1
3 12th 284.0878 126.824531 441.3510 1
4 1st-4th 125.8750 5.656611 246.0934 1
5 5th-6th 176.0210 74.643760 277.3983 1
6 7th-8th 233.9396 154.388060 313.4912 1
7 9th 342.0895 -44.104225 728.2832 0
...
# Edit me
# Edit me