- Wrap up of Lecture 2 content
- More on data frames
- Basic tidyverse (dplyr) commands
- Lists
- Writing functions in R
- If-else statements
- R coding style
Fall 2020
Let’s go back to where we left off in the lecture 2 slides.
dplyr
, but we’ll load all of tidyverse
anywaylibrary(tidyverse)
Rather than loading the full MASS library, we’ll use the ::
syntax to pull a specific object/function from the library
Loading all of MASS with library(MASS)
after tidyverse is loaded has the unintended consequence of replacing the dplyr
select command with the MASS
select command. This is BAD, and leads to errors.
Cars93 <- MASS::Cars93 head(Cars93, 3)
## Manufacturer Model Type Min.Price Price Max.Price MPG.city ## 1 Acura Integra Small 12.9 15.9 18.8 25 ## 2 Acura Legend Midsize 29.2 33.9 38.7 18 ## 3 Audi 90 Compact 25.9 29.1 32.3 20 ## MPG.highway AirBags DriveTrain Cylinders EngineSize ## 1 31 None Front 4 1.8 ## 2 25 Driver & Passenger Front 6 3.2 ## 3 26 Driver only Front 6 2.8 ## Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity ## 1 140 6300 2890 Yes 13.2 ## 2 200 5500 2335 Yes 18.0 ## 3 172 5500 2280 Yes 16.9 ## Passengers Length Wheelbase Width Turn.circle Rear.seat.room ## 1 5 177 102 68 37 26.5 ## 2 5 195 115 71 38 30.0 ## 3 5 180 102 67 37 28.0 ## Luggage.room Weight Origin Make ## 1 11 2705 non-USA Acura Integra ## 2 15 3560 non-USA Acura Legend ## 3 14 3375 non-USA Audi 90
mutate()
function from dplyr
mutate()
returns a new data frame with columns modified or added as specified by the function callCars93.metric <- mutate(Cars93, KMPL.city = 0.425 * MPG.city, KMPL.highway = 0.425 * MPG.highway) tail(names(Cars93.metric))
## [1] "Luggage.room" "Weight" "Origin" "Make" ## [5] "KMPL.city" "KMPL.highway"
# Add a new column called KMPL.city.2 Cars93.metric$KMPL.city.2 <- 0.425 * Cars93$MPG.city tail(names(Cars93.metric))
## [1] "Weight" "Origin" "Make" "KMPL.city" ## [5] "KMPL.highway" "KMPL.city.2"
identical(Cars93.metric$KMPL.city, Cars93.metric$KMPL.city.2)
## [1] TRUE
recode()
manufacturer <- Cars93$Manufacturer head(manufacturer, 8)
## [1] Acura Acura Audi Audi BMW Buick Buick Buick ## 32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser ... Volvo
We’ll use the recode()
function from the dplyr
library, which gets loaded when you load tidyverse
.
# Map Chevrolet, Pontiac and Buick to GM manufacturer.combined <- recode(manufacturer, "Chevrolet" = "GM", "Pontiac" = "GM", "Buick" = "GM") head(manufacturer.combined, 8)
## [1] Acura Acura Audi Audi BMW GM GM GM ## 30 Levels: Acura Audi BMW GM Cadillac Chrylser Chrysler Dodge ... Volvo
recode_factor()
A lot of data comes with integer encodings of levels
You may want to convert the integers to more meaningful values for the purpose of your analysis
Let’s pretend that in the class survey ‘Program’ was coded as an integer with 1 = MISM, 2 = Other, 3 = PPM
# Load data survey <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020.csv", header=TRUE, sep=",") # Recode Program to have integer codings survey <- mutate(survey, Program=as.numeric(Program)) head(survey)
## Program PriorExp Rexperience OperatingSystem TVhours ## 1 3 Some experience Never used Windows 10.5 ## 2 2 Extensive experience Basic competence Mac OS X 3.0 ## 3 1 Never programmed before Basic competence Windows 0.0 ## 4 3 Never programmed before Never used Windows 10.0 ## 5 3 Never programmed before Never used Windows 4.0 ## 6 3 Some experience Basic competence Mac OS X 0.0 ## Editor ## 1 Other ## 2 Microsoft Word ## 3 Microsoft Word ## 4 Excel ## 5 Microsoft Word ## 6 Microsoft Word
recode_factor()
, a variant of recode
that returns a factor, with elements ordered according to the mapping order.
around the numbers, which are necessary for parsingsurvey <- mutate(survey, Program = recode_factor(Program, `3` = "PPM", `1` = "MISM", `2` = "Other")) head(survey)
## Program PriorExp Rexperience OperatingSystem TVhours ## 1 PPM Some experience Never used Windows 10.5 ## 2 Other Extensive experience Basic competence Mac OS X 3.0 ## 3 MISM Never programmed before Basic competence Windows 0.0 ## 4 PPM Never programmed before Never used Windows 10.0 ## 5 PPM Never programmed before Never used Windows 4.0 ## 6 PPM Some experience Basic competence Mac OS X 0.0 ## Editor ## 1 Other ## 2 Microsoft Word ## 3 Microsoft Word ## 4 Excel ## 5 Microsoft Word ## 6 Microsoft Word
table()
functionLet’s revisit the Cars93 dataset
The table()
function builds contingency tables (i.e., count tables) showing counts at each combination of factor levels
table(Cars93$AirBags)
## ## Driver & Passenger Driver only None ## 16 43 34
table(Cars93$Origin)
## ## USA non-USA ## 48 45
table(Cars93$AirBags, Cars93$Origin)
## ## USA non-USA ## Driver & Passenger 9 7 ## Driver only 23 20 ## None 16 18
Looks like US and non-US cars had about the same distribution of AirBag types
Later in the class we’ll learn how to do a hypothesis tests on this kind of data
table()
is supplied a data frame, it produces contingency tables for all combinations of factorshead(Cars93[c("AirBags", "Origin")], 3)
## AirBags Origin ## 1 None non-USA ## 2 Driver & Passenger non-USA ## 3 Driver only non-USA
table(Cars93[c("AirBags", "Origin")])
## Origin ## AirBags USA non-USA ## Driver & Passenger 9 7 ## Driver only 23 20 ## None 16 18
count()
If we’re going to be plotting or further analysing our results, it is helpful to have them in a data frame instead of a tabular layout. That’s where the count()
function comes in.
Cars93 %>% count(AirBags)
## # A tibble: 3 x 2 ## AirBags n ## <fct> <int> ## 1 Driver & Passenger 16 ## 2 Driver only 43 ## 3 None 34
Cars93 %>% count(AirBags, Origin)
## # A tibble: 6 x 3 ## AirBags Origin n ## <fct> <fct> <int> ## 1 Driver & Passenger USA 9 ## 2 Driver & Passenger non-USA 7 ## 3 Driver only USA 23 ## 4 Driver only non-USA 20 ## 5 None USA 16 ## 6 None non-USA 18
A list is a data structure that can be used to store different kinds of data
Recall: a vector is a data structure for storing similar kinds of data
To better understand the difference, consider the following example.
my.vector.1 <- c("Michael", 165, TRUE) # (name, weight, is.male) my.vector.1
## [1] "Michael" "165" "TRUE"
typeof(my.vector.1) # All the elements are now character strings!
## [1] "character"
my.vector.2 <- c(FALSE, TRUE, 27) # (is.male, is.citizen, age) typeof(my.vector.2)
## [1] "double"
Vectors expect elements to be all of the same type (e.g., Boolean
, numeric
, character
)
When data of different types are put into a vector, the R converts everything to a common type
To store data of different types in the same object, we use lists
Simple way to construct lists: use list()
function
(We’ll learn about functions like map
and map_chr
soon.)
my.list <- list("Michael", 165, TRUE) my.list
## [[1]] ## [1] "Michael" ## ## [[2]] ## [1] 165 ## ## [[3]] ## [1] TRUE
map_chr(my.list, typeof)
## [1] "character" "double" "logical"
patient.1 <- list(name="Michael", weight=165, is.male=TRUE) patient.1
## $name ## [1] "Michael" ## ## $weight ## [1] 165 ## ## $is.male ## [1] TRUE
patient.1$name # Get "name" element (returns a string)
## [1] "Michael"
patient.1[["name"]] # Get "name" element (returns a string)
## [1] "Michael"
patient.1["name"] # Get "name" slice (returns a list)
## $name ## [1] "Michael"
c(typeof(patient.1$name), typeof(patient.1["name"]))
## [1] "character" "list"
We have used a lot of built-in functions: mean()
, subset()
, plot()
, read.table()
…
An important part of programming and data analysis is to write custom functions
Functions help make code modular
Functions make debugging easier
Remember: this entire class is about applying functions to data
A function is a machine that turns input objects (arguments) into an output object (return value) according to a definite rule.
addOne <- function(x) { x + 1 }
x
is the argument or input
The function output is the input x
incremented by 1
addOne(12)
## [1] 13
# Ended here calculatePercentage <- function(x, y, d) { decimal <- x / y # Calculate decimal value round(100 * decimal, d) # Convert to % and round to d digits } calculatePercentage(27, 80, 1)
## [1] 33.8
createPatientRecord <- function(full.name, weight, height) { name.list <- strsplit(full.name, split=" ")[[1]] first.name <- name.list[1] last.name <- name.list[2] weight.in.kg <- weight / 2.2 height.in.m <- height * 0.0254 bmi <- weight.in.kg / (height.in.m ^ 2) list(first.name=first.name, last.name=last.name, weight=weight.in.kg, height=height.in.m, bmi=bmi) }
createPatientRecord("Michael Smith", 185, 12 * 6 + 1)
## $first.name ## [1] "Michael" ## ## $last.name ## [1] "Smith" ## ## $weight ## [1] 84.09091 ## ## $height ## [1] 1.8542 ## ## $bmi ## [1] 24.45884
threeNumberSummary <- function(x) { c(mean=mean(x), median=median(x), sd=sd(x)) } x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2 threeNumberSummary(x)
## mean median sd ## 5.296375 5.361622 2.081283
Oftentimes we want our code to have different effects depending on the features of the input
To code this up, we use if-else statements
calculateLetterGrade <- function(x) { if(x >= 90) { grade <- "A" } else if(x >= 80) { grade <- "B" } else if(x >= 70) { grade <- "C" } else { grade <- "F" } grade } course.grades <- c(92, 78, 87, 91, 62) map_chr(course.grades, calculateLetterGrade)
## [1] "A" "C" "B" "A" "F"
return()
In the previous examples we specified the output simply by writing the output variable as the last line of the function
More explicitly, we can use the return()
function
addOne <- function(x) { return(x + 1) } addOne(12)
## [1] 13
return()
function, but you can use it if necessary or if it makes writing a particular function easier.Let’s return back to the last few slides of lecture 2
Homework 1 due 1:30PM ET on Wednesday
Lab 3 is posted
If you have questions, feel free to post on the Piazza Discussion Forum or attend office hours