Lecture 5: plyr and introduction to graphics

Agenda

split-apply-combine (plyr)
Introduction to R graphics (ggplot2)

Packages

library(MASS)
library(plyr)
library(dplyr)
library(tibble)
library(ggplot2)

Getting started: birthwt dataset

We're going to start by operating on the birthwt dataset from the MASS library
Let's get it loaded and see what we're working with

str(birthwt)

'data.frame':   189 obs. of  10 variables:
 $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
 $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
 $ race : int  2 3 1 1 1 3 1 3 1 1 ...
 $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
 $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
 $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
 $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

tibbles

tibbles are nicer data frames
You may find it more convenient to work with tibbles instead of data frames
In particular, they have nicer and more informative default print settings

birthwt <- as_tibble(birthwt)
birthwt

# A tibble: 189 × 10
     low   age   lwt  race smoke   ptl    ht    ui   ftv   bwt
*  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1      0    19   182     2     0     0     0     1     0  2523
2      0    33   155     3     0     0     0     0     3  2551
3      0    20   105     1     1     0     0     0     1  2557
4      0    21   108     1     1     0     0     1     2  2594
5      0    18   107     1     1     0     0     1     0  2600
6      0    21   124     3     0     0     0     0     0  2622
7      0    22   118     1     0     0     0     0     1  2637
8      0    17   103     3     0     0     0     0     1  2637
9      0    29   123     1     1     0     0     0     1  2663
10     0    26   113     1     1     0     0     0     0  2665
# ... with 179 more rows

Renaming the variables

The dataset doesn't come with very descriptive variable names
Let's get better column names (use help(birthwt) to understand the variables and come up with better names)

colnames(birthwt)

 [1] "low"   "age"   "lwt"   "race"  "smoke" "ptl"   "ht"    "ui"   
 [9] "ftv"   "bwt"

# The default names are not very descriptive

colnames(birthwt) <- c("birthwt.below.2500", "mother.age", 
                       "mother.weight", "race", "mother.smokes", 
                       "previous.prem.labor", "hypertension", 
                       "uterine.irr", "physician.visits", "birthwt.grams")

# Better names!

Renaming the factors

All the factors are currently represented as integers
Let's use the mutate() and mapvalues() functions to convert variables to factors and give the factors more meaningful levels

birthwt <- mutate(birthwt, 
          race = as.factor(mapvalues(race, c(1, 2, 3), 
                                     c("white","black", "other"))),
          mother.smokes = as.factor(mapvalues(mother.smokes, 
                                              c(0,1), c("no", "yes"))),
          hypertension = as.factor(mapvalues(hypertension, 
                                             c(0,1), c("no", "yes"))),
          uterine.irr = as.factor(mapvalues(uterine.irr, 
                                            c(0,1), c("no", "yes"))),
          birthwt.below.2500 = as.factor(mapvalues(birthwt.below.2500,
                                                   c(0,1), c("no", "yes")))
                  )

Summary of the data

Now that things are coded correctly, we can look at an overall summary

summary(birthwt)

 birthwt.below.2500   mother.age    mother.weight      race   
 no :130            Min.   :14.00   Min.   : 80.0   black:26  
 yes: 59            1st Qu.:19.00   1st Qu.:110.0   other:67  
                    Median :23.00   Median :121.0   white:96  
                    Mean   :23.24   Mean   :129.8             
                    3rd Qu.:26.00   3rd Qu.:140.0             
                    Max.   :45.00   Max.   :250.0             
 mother.smokes previous.prem.labor hypertension uterine.irr
 no :115       Min.   :0.0000      no :177      no :161    
 yes: 74       1st Qu.:0.0000      yes: 12      yes: 28    
               Median :0.0000                              
               Mean   :0.1958                              
               3rd Qu.:0.0000                              
               Max.   :3.0000                              
 physician.visits birthwt.grams 
 Min.   :0.0000   Min.   : 709  
 1st Qu.:0.0000   1st Qu.:2414  
 Median :0.0000   Median :2977  
 Mean   :0.7937   Mean   :2945  
 3rd Qu.:1.0000   3rd Qu.:3487  
 Max.   :6.0000   Max.   :4990

A simple table

Let's use the tapply() function to see what the average birthweight looks like when broken down by race and smoking status

with(birthwt, tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean))

            no      yes
black 2854.500 2504.000
other 2815.782 2757.167
white 3428.750 2826.846

Questions you should be asking yourself:
- Does smoking status appear to have an effect on birth weight?
- Does the effect of smoking status appear to be consistent across racial groups?
- What is the association between race and birth weight?

What if we wanted nicer looking output?

Let's use the header {r, results='asis'}, along with the kable() function from the knitr library

library(knitr)
# Construct table (rounded to 0 decimal places)
tbl.round <- with(birthwt, round(tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean)))
# Print nicely
kable(tbl.round, format = "markdown")

	no	yes
black	2854	2504
other	2816	2757
white	3429	2827

kable() outputs the table in a way that Markdown can read and nicely display
Note: changing the CSS changes the table appearance

plyr: more general split-apply-combine operations

We've seen several examples of split-apply-combine in action
The basic principle to keep in mind is as follows:

Split a data set into piece (e.g., according to some factor)
Apply a function to each piece (e.g., mean)
Combine all the pieces into a single output (e.g., a table or data frame)

split-apply-combine: a simple example

What does plyr do?

plyr introduces a family of functions of the form XYply, where X specifies the input type and Y specifies the output type

X/Y option	Sepal.Width
`a`	array (e.g., vector or matrix)
`d`	data.frame
`l`	list
`_`	no output (valid only for `Y`, useful when plotting)

Usage: XYply(.data, .variables, .fun)

Effect: Take input of type X, and apply .fun to .data split up according to .variables, combining the answer into output of type Y

Example: Average birthweight by mother's race

# ddply
ddply(birthwt, ~ race, summarize, mean.bwt = mean(birthwt.grams))

   race mean.bwt
1 black 2719.692
2 other 2805.284
3 white 3102.719

# Compare to tapply:
with(birthwt,
     tapply(birthwt.grams, race, mean))

   black    other    white 
2719.692 2805.284 3102.719

data frames are much nicer for plotting

mean.bwt

   race mean.bwt
1 black 2719.692
2 other 2805.284
3 white 3102.719

ggplot(data = mean.bwt, 
       aes(x = race, 
           y = mean.bwt)) + 
  geom_bar(stat = "identity")

plot of chunk unnamed-chunk-12

Multiple calculated fields

Let's look at the average birthweight and proportion of babies with low birthweight broken down by smoking status

ddply(birthwt, ~ mother.smokes, summarize,
      mean.bwt = mean(birthwt.grams),
      low.bwt.prop = mean(birthwt.below.2500 == "yes"))

  mother.smokes mean.bwt low.bwt.prop
1            no 3055.696    0.2521739
2           yes 2771.919    0.4054054

Multiple splitting factors

We can add additional splitting variables by including additional terms in the formula

ddply(birthwt, ~ race + mother.smokes, summarize,
      mean.age = mean(mother.age),
      mean.bwt = mean(birthwt.grams),
      low.bwt.prop = mean(birthwt.below.2500 == "yes"))

   race mother.smokes mean.age mean.bwt low.bwt.prop
1 black            no 19.93750 2854.500   0.31250000
2 black           yes 24.10000 2504.000   0.60000000
3 other            no 22.36364 2815.782   0.36363636
4 other           yes 22.50000 2757.167   0.41666667
5 white            no 26.02273 3428.750   0.09090909
6 white           yes 22.82692 2826.846   0.36538462

Example: Association between mother's age and birth weight?

Is the mother's age correlated with birth weight?

with(birthwt, cor(birthwt.grams, mother.age))  # Calculate correlation

[1] 0.09031781

Does the correlation vary with smoking status?
- tapply can't help us here. But ddply still works!

ddply(birthwt, ~ mother.smokes, summarize,
      cor.bwt.age = cor(birthwt.grams, mother.age))

  mother.smokes cor.bwt.age
1            no   0.2014558
2           yes  -0.1441649

Graphics in R

We now know a lot about how to tabulate data
It's often easier to look at plots instead of tables
We'll now talk about some of the standard plotting options
Easier to do this in a live demo…
Please refer to .Rmd version of lecture notes for the graphics material