Download the homework2.Rmd
file from Canvas or the course website.
Open homework2.Rmd
in RStudio.
Replace the “Your Name Here” text in the author:
field with your own name.
Supply your solutions to the homework by editing homework2.Rmd
.
When you have completed the homework and have checked that your code both runs in the Console and knits correctly when you click Knit HTML
, rename the R Markdown file to homework2_YourNameHere.Rmd
, and submit on Canvas (YourNameHere should be changed to your own name.)
We’ll start by downloading a publicly available dataset that contains some census data information. This dataset is called income
.
# Import data file
income <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/income_data.txt", header=FALSE)
# Give variables names
colnames(income) <- c("age", "workclass", "fnlwgt", "education", "education.years", "marital.status", "occupation", "relationship", "race", "sex", "capital.gain", "capital.loss", "hours.per.week", "native.country", "income.bracket")
Use the table()
function to produce a contingency table of observation counts across marital status and sex.
# Edit me
prop.table()
function calculates a table of proportions from a table of counts. Read the documentation for this function to see how it works. Use prop.table()
and your table from problem (a) to form a (column) proportions table. The Female column of the table should show the proportion of women in each marital status category. The Male column will show the same, but for men.# Edit me
Replace this text with your answer. (do not delete the html tags)
Use the tapply()
function to produce a table showing the average education (in years) across marital status and sex categories.
# Edit me
tapply()
example (calculating Claims per Holder)The MASS
package contains a dataset called Insurance. Read the help file on this data set to understand its contents.
Use the tapply()
function to produce a table showing the total number of Holders across District and Age. Save this table in a variable, and also display your answer.
# Edit me
Use the tapply()
function to produce a table showing the total number of Claims across District and Age Save this table in a variable, and also display your answer.
# Edit me
Use your answers from parts (a) and (b) to produce a table that shows the rate of Claims per Holder across District and Age.
# Edit me
Tip: If an insurance company has 120,000 policy holders and receives 14,000 claims, the rate of claims per holder is 14000/120000 = 0.117
Replace this text with your answer. (do not delete the html tags)
This exercise will give you practice with two of the most common data cleaning tasks. For this problem we’ll use the survey_untidy.csv
data set posted on the course website. Begin by importing this data into R. The url for the data set is shown below.
url: http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_untidy.csv
In Lecture 4 we look at an example of cleaning up the TVhours column. The TVhours column of survey_untidy.csv
has been corrupted in a similar way to what you saw in class.
Using the techniques you saw in class, make a new version of the untidy survey data where the TVhours column has been cleaned up. (Hint: you may need to handle some of the observations on a case-by-case basis)
# Edit me
This exercise picks up from Problem 3, and walks you through two different approaches to cleaning up the Program column
Use the table
or levels
command on the Program column to figure out what went wrong with this column. Describe the problem in the space below.
# Write your code here
Description of the problem:
Replace this text with your answer. (do not delete the html tags)
mapvalues
approachStarting with the cleaned up data you produced in Problem 3, use the mapvalues
and mutate
functions to fix the Program column by mapping all of the lowercase and mixed case program names to upper case.
library(plyr)
library(dplyr)
# Edit me
toupper
approachThe toupper
function takes an array of character strings and converts all letters to uppercase.
Use toupper()
and mutate
to perform the same data cleaning task as in part (b).
# Edit me
Tip: The toupper()
and tolower()
functions are very useful in data cleaning tasks. You may want to start by running these functions even if you’ll have to do some more spot-cleaning later on.
Write a function that calculates the mean of a numeric vector x
, ignoring the s
smallest and l
largest values (this is a trimmed mean).
E.g., if x = c(1, 7, 3, 2, 5, 0.5, 9, 10)
, s = 1
, and l = 2
, your function would return the mean of c(1, 7, 3, 2, 5)
(this is x
with the 1 smallest value (0.5) and the 2 largest values (9, 10) removed).
Your function should use the length()
function to check if x
has at least s + l + 1
values. If x
is shorter than s + l + 1
, your function should use the message()
function to tell the user that the vector can’t be trimmed as requested. If x
is at least length s + l + 1
, your function should return the trimmed mean.
# Here's a function skeleton to get you started
# Fill me in with an informative comment
# describing what the function does
trimmedMean <- function(x, s = 0, l = 0) {
# Write your code here
}
Hint: For this exercise it will be useful to recall the sort()
function that you first saw in Lecture 1.
Note: The s = 0
and l = 0
specified in the function definition are the default settings. i.e., this syntax ensures that if s
and l
are not provided by the user, they are both set to 0
. Thus the default behaviour is that the trimmedMean
function doesn’t trim anything, and hence is the same as the mean
function.
set.seed(201802) # Sets seed to make sure everyone's random vectors are generated the same
list.random <- list(x = rnorm(50),
y = rexp(65),
z = rt(100, df = 1.5))
# Here's a Figure showing histograms of the data
par(mfrow = c(1,3))
hist(list.random$x, breaks = 15, col = 'grey')
hist(list.random$y, breaks = 10, col = 'forestgreen')
hist(list.random$z, breaks = 20, col = 'steelblue')
Using a for loop
and your function from part (a), create a vector whose elements are the trimmed means of the vectors in list.random
, taking s = 5
and l = 5
.
# Edit me
# Edit me
Explanation:
Replace this text with your answer. (do not delete the html tags)
Repeat part (b), using the lapply
and sapply
functions instead of a for loop. Your lapply
command should return a list of trimmed means, and your sapply
command should return a vector of trimmed means.
# Edit me
Hint lapply
and sapply
can take arguments that you wish to pass to the trimmedMean
function. E.g., if you were applying the function sort
, which has an argument decreasing
, you could use the syntax lapply(..., FUN = sort, decreasing = TRUE)
.