Prof. Alexandra Chouldechova
94842
apply()
, lapply()
, sapply()
, tapply()
with()
to specify scopelibrary(plyr)
library(dplyr)
One of the most common problems you'll encounter when importing manually-entered data is inconsistent data types within columns
For a simple example, let's look at TVhours
column in a messy version of the survey data from Lecture 2
survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_messy.csv", header=TRUE)
survey.messy$TVhours
[1] 2 15 incl movies 16 0
[5] 2 hours 5ish 0 4
[9] 0 0 14 7-Jun
[13] 10 15 4 4
[17] 10 7 33 (Netflix) 8
[21] 8 falkjklj 0 0
[25] 0 none 6 1
[29] 0 0 4
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none
str(survey.messy)
'data.frame': 31 obs. of 6 variables:
$ Program : Factor w/ 3 levels "MISM","Other",..: 3 3 3 2 3 3 3 2 3 2 ...
$ PriorExp : Factor w/ 3 levels "Extensive experience",..: 3 2 3 3 2 2 3 3 3 2 ...
$ Rexperience : Factor w/ 4 levels "Basic competence",..: 3 3 1 1 3 3 1 1 1 3 ...
$ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 1 2 1 1 2 2 1 2 2 2 ...
$ TVhours : Factor w/ 18 levels "0","1","10","14",..: 8 6 7 1 9 12 1 11 1 1 ...
$ Editor : Factor w/ 4 levels "Excel","LaTeX",..: 3 3 3 2 3 3 3 3 3 3 ...
Several of the entries have non-numeric values in them (they contain strings)
As a result, TVhours
is being imported as factor
survey.messy$TVhours
[1] 2 15 incl movies 16 0
[5] 2 hours 5ish 0 4
[9] 0 0 14 7-Jun
[13] 10 15 4 4
[17] 10 7 33 (Netflix) 8
[21] 8 falkjklj 0 0
[25] 0 none 6 1
[29] 0 0 4
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none
tv.hours.messy <- survey.messy$TVhours
tv.hours.messy
[1] 2 15 incl movies 16 0
[5] 2 hours 5ish 0 4
[9] 0 0 14 7-Jun
[13] 10 15 4 4
[17] 10 7 33 (Netflix) 8
[21] 8 falkjklj 0 0
[25] 0 none 6 1
[29] 0 0 4
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none
as.numeric(tv.hours.messy)
[1] 8 6 7 1 9 12 1 11 1 1 4 15 3 5 11 11 3 14 10 16 16 17 1
[24] 1 1 18 13 2 1 1 11
tv.hours.messy
as.numeric(tv.hours.messy)
[1] 2 15 incl movies 16 0
[5] 2 hours 5ish 0 4
[9] 0 0 14 7-Jun
[13] 10 15 4 4
[17] 10 7 33 (Netflix) 8
[21] 8 falkjklj 0 0
[25] 0 none 6 1
[29] 0 0 4
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none
[1] 8 6 7 1 9 12 1 11 1 1 4 15 3 5 11 11 3 14 10 16 16 17 1
[24] 1 1 18 13 2 1 1 11
This just converted all the values into the integer-coded levels of the factor
Not what we wanted!
num.vec <- c(3.1, 2.5)
as.factor(num.vec)
[1] 3.1 2.5
Levels: 2.5 3.1
as.numeric(as.factor(num.vec))
[1] 2 1
as.numeric(as.character(as.factor(num.vec)))
[1] 3.1 2.5
If we take a number that's being coded as a factor and first turn it into a
character
string, then converting the string to a numeric gets back the number
as.character(tv.hours.messy)
[1] "2" "15 incl movies" "16" "0"
[5] "2 hours" "5ish" "0" "4"
[9] "0" "0" "14" "7-Jun"
[13] "10" "15" "4" "4"
[17] "10" "7" "33 (Netflix)" "8"
[21] "8" "falkjklj" "0" "0"
[25] "0" "none" "6" "1"
[29] "0" "0" "4"
as.numeric(as.character(tv.hours.messy))
[1] 2 NA 16 0 NA NA 0 4 0 0 14 NA 10 15 4 4 10 7 NA 8 8 NA 0
[24] 0 0 NA 6 1 0 0 4
typeof(as.numeric(as.character(tv.hours.messy))) # Success!! (Almost...)
[1] "double"
All the corrupted cells now appear as NA
, which is R's missing indicator
We can do a little better by cleaning up the vector once we get it to character
form
tv.hours.strings <- as.character(tv.hours.messy)
tv.hours.strings
[1] "2" "15 incl movies" "16" "0"
[5] "2 hours" "5ish" "0" "4"
[9] "0" "0" "14" "7-Jun"
[13] "10" "15" "4" "4"
[17] "10" "7" "33 (Netflix)" "8"
[21] "8" "falkjklj" "0" "0"
[25] "0" "none" "6" "1"
[29] "0" "0" "4"
tv.hours.strings
[1] "2" "15 incl movies" "16" "0"
[5] "2 hours" "5ish" "0" "4"
[9] "0" "0" "14" "7-Jun"
[13] "10" "15" "4" "4"
[17] "10" "7" "33 (Netflix)" "8"
[21] "8" "falkjklj" "0" "0"
[25] "0" "none" "6" "1"
[29] "0" "0" "4"
# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", tv.hours.strings)
[1] "2" "15" "16" "0" "2" "5" "0" "4" "0" "0" "14" "7" "10" "15"
[15] "4" "4" "10" "7" "33" "8" "8" "" "0" "0" "0" "" "6" "1"
[29] "0" "0" "4"
tv.hours.messy[1:30]
[1] 2 15 incl movies 16 0
[5] 2 hours 5ish 0 4
[9] 0 0 14 7-Jun
[13] 10 15 4 4
[17] 10 7 33 (Netflix) 8
[21] 8 falkjklj 0 0
[25] 0 none 6 1
[29] 0 0
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none
tv.hours.clean <- as.numeric(gsub("[^0-9.]", "", tv.hours.strings))
tv.hours.clean
[1] 2 15 16 0 2 5 0 4 0 0 14 7 10 15 4 4 10 7 33 8 8 NA 0
[24] 0 0 NA 6 1 0 0 4
NA
values should really be 0
.survey <- mutate(survey.messy, TVhours = tv.hours.clean)
str(survey)
'data.frame': 31 obs. of 6 variables:
$ Program : Factor w/ 3 levels "MISM","Other",..: 3 3 3 2 3 3 3 2 3 2 ...
$ PriorExp : Factor w/ 3 levels "Extensive experience",..: 3 2 3 3 2 2 3 3 3 2 ...
$ Rexperience : Factor w/ 4 levels "Basic competence",..: 3 3 1 1 3 3 1 1 1 3 ...
$ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 1 2 1 1 2 2 1 2 2 2 ...
$ TVhours : num 2 15 16 0 2 5 0 4 0 0 ...
$ Editor : Factor w/ 4 levels "Excel","LaTeX",..: 3 3 3 2 3 3 3 3 3 3 ...
stringsAsFactors = FALSE
when importing our data.survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_messy.csv", header=TRUE, stringsAsFactors=FALSE)
str(survey.messy)
'data.frame': 31 obs. of 6 variables:
$ Program : chr "PPM" "PPM" "PPM" "Other" ...
$ PriorExp : chr "Some experience" "Never programmed before" "Some experience" "Some experience" ...
$ Rexperience : chr "Never used" "Never used" "Basic competence" "Basic competence" ...
$ OperatingSystem: chr "Mac OS X" "Windows" "Mac OS X" "Mac OS X" ...
$ TVhours : chr "2" "15 incl movies" "16" "0" ...
$ Editor : chr "Microsoft Word" "Microsoft Word" "Microsoft Word" "LaTeX" ...
character
instead of a factor
TVhours
column and cast it to numeric all in one commandsurvey <- mutate(survey.messy,
TVhours = as.numeric(gsub("[^0-9.]", "", TVhours)))
str(survey)
'data.frame': 31 obs. of 6 variables:
$ Program : chr "PPM" "PPM" "PPM" "Other" ...
$ PriorExp : chr "Some experience" "Never programmed before" "Some experience" "Some experience" ...
$ Rexperience : chr "Never used" "Never used" "Basic competence" "Basic competence" ...
$ OperatingSystem: chr "Mac OS X" "Windows" "Mac OS X" "Mac OS X" ...
$ TVhours : num 2 15 16 0 2 5 0 4 0 0 ...
$ Editor : chr "Microsoft Word" "Microsoft Word" "Microsoft Word" "LaTeX" ...
table(survey[["Program"]])
MISM Other PPM
4 8 19
table(as.factor(survey[["Program"]]))
MISM Other PPM
4 8 19
# Figure out which columns are coded as characters
chr.indexes <- sapply(survey, FUN = is.character)
chr.indexes
Program PriorExp Rexperience OperatingSystem
TRUE TRUE TRUE TRUE
TVhours Editor
FALSE TRUE
# Re-code all of the character columns to factors
survey[chr.indexes] <- lapply(survey[chr.indexes], FUN = as.factor)
str(survey)
'data.frame': 31 obs. of 6 variables:
$ Program : Factor w/ 3 levels "MISM","Other",..: 3 3 3 2 3 3 3 2 3 2 ...
$ PriorExp : Factor w/ 3 levels "Extensive experience",..: 3 2 3 3 2 2 3 3 3 2 ...
$ Rexperience : Factor w/ 4 levels "Basic competence",..: 3 3 1 1 3 3 1 1 1 3 ...
$ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 1 2 1 1 2 2 1 2 2 2 ...
$ TVhours : num 2 15 16 0 2 5 0 4 0 0 ...
$ Editor : Factor w/ 4 levels "Excel","LaTeX",..: 3 3 3 2 3 3 3 3 3 3 ...
On Homework 2 you'll learn how to wrangle with another common problem
When data is entered manually, misspellings and case changes are very common
E.g., a column showing treatment information may look like,
life.support <- as.factor(c("dialysis", "Ventilation", "Dialysis", "dialysis", "none", "None", "nnone", "dyalysis", "dialysis", "ventilation", "none"))
summary(life.support)
dialysis Dialysis dyalysis nnone none None
3 1 1 1 2 1
ventilation Ventilation
1 1
summary(life.support)
dialysis Dialysis dyalysis nnone none None
3 1 1 1 2 1
ventilation Ventilation
1 1
This factor has 8 levels even though it should have 3 (dialysis, ventilation, none)
We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later
There's a faster way to fix just the capitalization issue (this is an exercise on Homework 2)
These are all efficient ways of applying a function to margins of an array or elements of a list
Before we talk about the details of apply()
and its relatives, we should first understand loops
loops are ways of iterating over data
The apply()
functions can be thought of as good alternatives to loops
for(i in 1:4) {
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
phrase <- paste(phrase, word)
print(phrase)
}
[1] "Good Night, and"
[1] "Good Night, and Good"
[1] "Good Night, and Good Luck"
A for loop executes a chunk of code for every value of an index variable in an index set
for(index.variable in index.set) {
code to be repeated at every value of index.variable
}
index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
print(c(i, typeof(i)))
}
[1] "Michael" "character"
[1] "185" "double"
[1] "TRUE" "logical"
fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
[,1] [,2] [,3] [,4] [,5]
[1,] -1.012349 1.6522842 1.627709 -1.0352775 -1.21955105
[2,] -1.426333 -0.7209012 1.163312 -0.4282115 -0.05492043
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
[1] 8.998670 -6.301823 -6.981004 -3.153687 12.620116
colSums(fake.data) # A better approach (see also colMeans())
[1] 8.998670 -6.301823 -6.981004 -3.153687 12.620116
day <- 1
num.days <- 365
while(day <= num.days) {
day <- day + 1
}
We won't really be using while loops in this class
Just be aware that they exist, and that they may become useful to you at some point in your analytics career
Command | Description |
---|---|
apply(X, MARGIN, FUN) |
Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X |
lapply(X, FUN) |
Obtain a list by applying FUN to the elements of a list X |
sapply(X, FUN) |
Simplified version of lapply . Returns a vector/array instead of list. |
tapply(X, INDEX, FUN) |
Obtain a table by applying FUN to each combination of the factors given in INDEX |
These functions are (good!) alternatives to loops
They are typically more efficient than loops (often run considerably faster on large data sets)
Take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
You can always type example(function)
to get code examples (E.g., example(apply)
)
colMeans(fake.data)
[1] 0.08998670 -0.06301823 -0.06981004 -0.03153687 0.12620116
apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns
[1] 0.08998670 -0.06301823 -0.06981004 -0.03153687 0.12620116
# Function that calculates proportion of vector indexes that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive)
[1] 0.53 0.45 0.46 0.49 0.52
lapply(survey, is.factor) # Returns a list
$Program
[1] TRUE
$PriorExp
[1] TRUE
$Rexperience
[1] TRUE
$OperatingSystem
[1] TRUE
$TVhours
[1] FALSE
$Editor
[1] TRUE
sapply(survey, FUN = is.factor) # Returns a vector with named elements
Program PriorExp Rexperience OperatingSystem
TRUE TRUE TRUE TRUE
TVhours Editor
FALSE TRUE
apply(cars, 2, FUN=mean) # Data frames are arrays
speed dist
15.40 42.98
lapply(cars, FUN=mean) # Data frames are also lists
$speed
[1] 15.4
$dist
[1] 42.98
sapply(cars, FUN=mean) # sapply() is just simplified lapply()
speed dist
15.40 42.98
tapply()
as a generalized form of the table()
functionlibrary(MASS)
# Get a count table, data broken down by Origin and DriveTrain
table(Cars93$Origin, Cars93$DriveTrain)
4WD Front Rear
USA 5 34 9
non-USA 5 33 7
# Calculate average MPG.City, broken down by Origin and Drivetrain
tapply(Cars93$MPG.city, INDEX = Cars93[c("Origin", "DriveTrain")], FUN=mean)
DriveTrain
Origin 4WD Front Rear
USA 17.6 22.14706 18.33333
non-USA 23.4 24.93939 19.14286
Origin
and Type
tapply(Cars93[["Horsepower"]], INDEX = Cars93[c("Origin", "Type")], FUN=mean)
Type
Origin Compact Large Midsize Small Sporty Van
USA 117.4286 179.4545 153.5000 89.42857 166.5000 158.40
non-USA 141.5556 NA 189.4167 91.78571 151.6667 138.25
NA
doing there?any(Cars93$Origin == "non-USA" & Cars93$Type == "Large")
[1] FALSE
tapply()
to produce the same output as the table()
functionlibrary(MASS)
# Get a count table, data broken down by Origin and DriveTrain
table(Cars93$Origin, Cars93$DriveTrain)
4WD Front Rear
USA 5 34 9
non-USA 5 33 7
# This one may take a moment to figure out...
tapply(rep(1, nrow(Cars93)), INDEX = Cars93[c("Origin", "DriveTrain")], FUN=sum)
DriveTrain
Origin 4WD Front Rear
USA 5 34 9
non-USA 5 33 7
Thus far we've repeatedly typed out the data frame name when referencing its columns
This is because the data variables don't exist in our working environment
Using with(data, expr)
lets us specify that the code in expr
should be evaluated in an environment that contains the elements of data
as variables
with(Cars93, table(Origin, Type))
Type
Origin Compact Large Midsize Small Sporty Van
USA 7 11 10 7 8 5
non-USA 9 0 12 14 6 4
any(Cars93$Origin == "non-USA" & Cars93$Type == "Large")
[1] FALSE
with(Cars93, any(Origin == "non-USA" & Type == "Large")) # Same effect!
[1] FALSE
with(Cars93, tapply(Horsepower, INDEX = list(Origin, Type), FUN=mean))
Compact Large Midsize Small Sporty Van
USA 117.4286 179.4545 153.5000 89.42857 166.5000 158.40
non-USA 141.5556 NA 189.4167 91.78571 151.6667 138.25
Homework 2 will be posted later today
Lab 4 link is available on the course website