Prof. Alexandra Chouldechova
94842
library(MASS)
library(plyr)
library(dplyr)
library(tibble)
library(ggplot2)
We're going to start by operating on the birthwt
dataset from the MASS library
Let's get it loaded and see what we're working with
str(birthwt)
'data.frame': 189 obs. of 10 variables:
$ low : int 0 0 0 0 0 0 0 0 0 0 ...
$ age : int 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
$ race : int 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
$ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
$ ht : int 0 0 0 0 0 0 0 0 0 0 ...
$ ui : int 1 0 0 1 1 0 0 0 0 0 ...
$ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
$ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
birthwt <- as_tibble(birthwt)
birthwt
# A tibble: 189 × 10
low age lwt race smoke ptl ht ui ftv bwt
* <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 19 182 2 0 0 0 1 0 2523
2 0 33 155 3 0 0 0 0 3 2551
3 0 20 105 1 1 0 0 0 1 2557
4 0 21 108 1 1 0 0 1 2 2594
5 0 18 107 1 1 0 0 1 0 2600
6 0 21 124 3 0 0 0 0 0 2622
7 0 22 118 1 0 0 0 0 1 2637
8 0 17 103 3 0 0 0 0 1 2637
9 0 29 123 1 1 0 0 0 1 2663
10 0 26 113 1 1 0 0 0 0 2665
# ... with 179 more rows
The dataset doesn't come with very descriptive variable names
Let's get better column names (use help(birthwt)
to understand the variables and come up with better names)
colnames(birthwt)
[1] "low" "age" "lwt" "race" "smoke" "ptl" "ht" "ui"
[9] "ftv" "bwt"
# The default names are not very descriptive
colnames(birthwt) <- c("birthwt.below.2500", "mother.age",
"mother.weight", "race", "mother.smokes",
"previous.prem.labor", "hypertension",
"uterine.irr", "physician.visits", "birthwt.grams")
# Better names!
All the factors are currently represented as integers
Let's use the mutate()
and mapvalues()
functions to convert variables to factors and give the factors more meaningful levels
birthwt <- mutate(birthwt,
race = as.factor(mapvalues(race, c(1, 2, 3),
c("white","black", "other"))),
mother.smokes = as.factor(mapvalues(mother.smokes,
c(0,1), c("no", "yes"))),
hypertension = as.factor(mapvalues(hypertension,
c(0,1), c("no", "yes"))),
uterine.irr = as.factor(mapvalues(uterine.irr,
c(0,1), c("no", "yes"))),
birthwt.below.2500 = as.factor(mapvalues(birthwt.below.2500,
c(0,1), c("no", "yes")))
)
summary(birthwt)
birthwt.below.2500 mother.age mother.weight race
no :130 Min. :14.00 Min. : 80.0 black:26
yes: 59 1st Qu.:19.00 1st Qu.:110.0 other:67
Median :23.00 Median :121.0 white:96
Mean :23.24 Mean :129.8
3rd Qu.:26.00 3rd Qu.:140.0
Max. :45.00 Max. :250.0
mother.smokes previous.prem.labor hypertension uterine.irr
no :115 Min. :0.0000 no :177 no :161
yes: 74 1st Qu.:0.0000 yes: 12 yes: 28
Median :0.0000
Mean :0.1958
3rd Qu.:0.0000
Max. :3.0000
physician.visits birthwt.grams
Min. :0.0000 Min. : 709
1st Qu.:0.0000 1st Qu.:2414
Median :0.0000 Median :2977
Mean :0.7937 Mean :2945
3rd Qu.:1.0000 3rd Qu.:3487
Max. :6.0000 Max. :4990
tapply()
function to see what the average birthweight looks like when broken down by race and smoking statuswith(birthwt, tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean))
no yes
black 2854.500 2504.000
other 2815.782 2757.167
white 3428.750 2826.846
{r, results='asis'}
, along with the kable()
function from the knitr
librarylibrary(knitr)
# Construct table (rounded to 0 decimal places)
tbl.round <- with(birthwt, round(tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean)))
# Print nicely
kable(tbl.round, format = "markdown")
no | yes | |
---|---|---|
black | 2854 | 2504 |
other | 2816 | 2757 |
white | 3429 | 2827 |
kable()
outputs the table in a way that Markdown can read and nicely display
Note: changing the CSS changes the table appearance
We've seen several examples of split-apply-combine in action
The basic principle to keep in mind is as follows:
Split a data set into piece (e.g., according to some factor)
Apply a function to each piece (e.g., mean)
Combine all the pieces into a single output (e.g., a table or data frame)
plyr introduces a family of functions of the form XYply
, where X
specifies the input type and Y
specifies the output type
X/Y option | Sepal.Width |
---|---|
a |
array (e.g., vector or matrix) |
d |
data.frame |
l |
list |
_ |
no output (valid only for Y , useful when plotting) |
Usage:
XYply(.data, .variables, .fun)
Effect: Take input of type
X
, and apply.fun
to.data
split up according to.variables
, combining the answer into output of typeY
# ddply
ddply(birthwt, ~ race, summarize, mean.bwt = mean(birthwt.grams))
race mean.bwt
1 black 2719.692
2 other 2805.284
3 white 3102.719
# Compare to tapply:
with(birthwt,
tapply(birthwt.grams, race, mean))
black other white
2719.692 2805.284 3102.719
mean.bwt
race mean.bwt
1 black 2719.692
2 other 2805.284
3 white 3102.719
ggplot(data = mean.bwt,
aes(x = race,
y = mean.bwt)) +
geom_bar(stat = "identity")
ddply(birthwt, ~ mother.smokes, summarize,
mean.bwt = mean(birthwt.grams),
low.bwt.prop = mean(birthwt.below.2500 == "yes"))
mother.smokes mean.bwt low.bwt.prop
1 no 3055.696 0.2521739
2 yes 2771.919 0.4054054
ddply(birthwt, ~ race + mother.smokes, summarize,
mean.age = mean(mother.age),
mean.bwt = mean(birthwt.grams),
low.bwt.prop = mean(birthwt.below.2500 == "yes"))
race mother.smokes mean.age mean.bwt low.bwt.prop
1 black no 19.93750 2854.500 0.31250000
2 black yes 24.10000 2504.000 0.60000000
3 other no 22.36364 2815.782 0.36363636
4 other yes 22.50000 2757.167 0.41666667
5 white no 26.02273 3428.750 0.09090909
6 white yes 22.82692 2826.846 0.36538462
with(birthwt, cor(birthwt.grams, mother.age)) # Calculate correlation
[1] 0.09031781
ddply(birthwt, ~ mother.smokes, summarize,
cor.bwt.age = cor(birthwt.grams, mother.age))
mother.smokes cor.bwt.age
1 no 0.2014558
2 yes -0.1441649
We now know a lot about how to tabulate data
It's often easier to look at plots instead of tables
We'll now talk about some of the standard plotting options
Easier to do this in a live demo…
Please refer to .Rmd version of lecture notes for the graphics material