This lecture continues our introduction to the ggplot2
package developed by Hadley Wickham.
Let’s begin by loading the packages we’ll use this class
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(MASS) # Useful for data sets
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
ggplot2 has a slightly steeper learning curve than the base graphics functions, but it also generally produces far better and more easily customizable graphics.
There are two basic calls in ggplot:
qplot(x, y, ..., data)
: a “quick-plot” routine, which essentially replaces the base plot()
ggplot(data, aes(x, y, ...), ...)
: defines a graphics object from which plots can be generated, along with aesthetic mappings that specify how variables are mapped to visual properties.The ggplot2
library comes with a dataset called diamonds
. Let’s look at it
dim(diamonds)
## [1] 53940 10
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
It is a data frame of 53,940 diamonds, recording their attributes such as carat, cut, color, clarity, and price.
We will make a scatterplot showing the price as a function of the carat (size). (The data set is large so the plot may take a few moments to generate.)
diamond.plot <- ggplot(data=diamonds, aes(x=carat, y=price))
diamond.plot + geom_point()
Let’s take a step back and try to understand the ggplot syntax.
The first thing we did was to define a graphics object, diamond.plot
. This definition told R that we’re using the diamonds
data, and that we want to display carat
on the x-axis, and price
on the y-axis.
We then called diamond.plot + geom_point()
to get a scatterplot.
The arguments passed to aes()
are called mappings. Mappings specify what variables are used for what purpose. When you use geom_point()
in the second line, it pulls x
, y
, colour
, size
, etc., from the mappings specified in the ggplot()
command.
You can also specify some arguments to geom_point
directly if you want to specify them for each plot separately instead of pre-specifying a default.
Here we shrink the points to a smaller size, and use the alpha
argument to make the points transparent.
diamond.plot + geom_point(size = 0.7, alpha = 0.3)
If we wanted to let point color depend on the color indicator of the diamond, we could do so in the following way.
diamond.plot <- ggplot(data=diamonds, aes(x=carat, y=price, colour = color))
diamond.plot + geom_point()
To make the scatterplot look more typical, we can switch to logarithmic coordinate axis spacing.
diamond.plot + geom_point() +
coord_trans(x = "log10", y = "log10")
We can create plots showing the relationship between variables across different values of a factor. For instance, here’s a scatterplot showing how diamond price varies with carat size, conditioned on color. It’s created using the facet_wrap(~ factor1 + factor2 + ... + factorn)
command.
diamond.plot <- ggplot(data=diamonds, aes(x=carat, y=price, colour = color))
diamond.plot + geom_point() + facet_wrap(~ cut) +
coord_trans(x = "log10", y = "log10")
You can also use facet_grid()
to produce this type of output.
diamond.plot + geom_point() + facet_grid(. ~ cut) +
coord_trans(x = "log10", y = "log10")
diamond.plot + geom_point() + facet_grid(cut ~ .) +
coord_trans(x = "log10", y = "log10")
ggplot
can create a lot of different kinds of plots, just like lattice. Here are some examples.
Function | Description |
---|---|
geom_point(...) |
Points, i.e., scatterplot |
geom_bar(...) |
Bar chart |
geom_line(...) |
Line chart |
geom_boxplot(...) |
Boxplot |
geom_violin(...) |
Violin plot |
geom_density(...) |
Density plot with one variable |
geom_density2d(...) |
Density plot with two variables |
geom_histogram(...) |
Histogram |
We’ll use the birthwt
data for this example, so let’s start by loading an cleaning it. All of this code is borrowed from Lecture 5.
birthwt <- as_tibble(MASS::birthwt)
birthwt <- birthwt %>%
rename(birthwt.below.2500 = low,
mother.age = age,
mother.weight = lwt,
mother.smokes = smoke,
previous.prem.labor = ptl,
hypertension = ht,
uterine.irr = ui,
physician.visits = ftv,
birthwt.grams = bwt)
# Assign better labels to categorical variables
birthwt <- birthwt %>%
mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other")) %>%
mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"),
~ recode_factor(.x, `0` = "no", `1` = "yes"))
Previously we calculated the following table:
bwt.summary <- birthwt %>%
group_by(race, mother.smokes) %>%
summarize(mean.birthwt = round(mean(birthwt.grams), 0))
bwt.summary
## # A tibble: 6 x 3
## # Groups: race [3]
## race mother.smokes mean.birthwt
## <fct> <fct> <dbl>
## 1 white no 3429
## 2 white yes 2827
## 3 black no 2854
## 4 black yes 2504
## 5 other no 2816
## 6 other yes 2757
We can plot this table in a nice bar chart as follows:
# Define basic aesthetic parameters
p.bwt <- ggplot(data = bwt.summary,
aes(y = mean.birthwt, x = race, fill = mother.smokes))
# Pick colors for the bars
bwt.colors <- c("#009E73", "#999999")
# Display barchart
p.bwt + geom_bar(stat = "identity", position = "dodge") +
ylab("Average birthweight") +
xlab("Mother's race") +
guides(fill = guide_legend(title = "Mother's smoking status")) +
scale_fill_manual(values=bwt.colors)
base.plot <- ggplot(birthwt, aes(x = mother.age)) +
xlab("Mother's age")
base.plot + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
base.plot + geom_histogram(aes(fill = race))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
base.plot + geom_density()
base.plot + geom_density(aes(fill = race), alpha = 0.5)
base.plot <- ggplot(birthwt, aes(x = as.factor(physician.visits), y = birthwt.grams)) +
xlab("Number of first trimester physician visits") +
ylab("Baby's birthweight (grams)")
# Box plot
base.plot + geom_boxplot()
# Violin plot
base.plot + geom_violin()
One of the main advantages of ggplot
is that it makes it really easy to overlay model fits, confidence bars, etc. We’ll get more practice with this next week, when we talk more about how to do statistical inference in R.
For the time being, let’s just display a scatterplot smoother.
diamond.plot + stat_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
This shows curves modelling the relationship between diamond size in carats and price for the different diamond colours.
We could use the geom_point()
command to also display the underlying points, but that would make the curves difficult to see. Let’s try it anyway.
diamond.plot + geom_point() + stat_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We can make this look better by decreasing the point opacity so that the trend curves aren’t so obscured.
diamond.plot + geom_point(size = 1.5, alpha = 0.25) + stat_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
In addition to the more flexible smoother in the example above, we can also overlay more structured summaries such as regression curves.
Here’s the plot that we saw at the end of Lecture 5, which makes good use of regression line overlays.
ggplot(birthwt, aes(x=mother.age, y=birthwt.grams, shape=mother.smokes, color=mother.smokes)) +
geom_point() + # Adds points (scatterplot)
geom_smooth(method = "lm") + # Adds regression lines
ylab("Birth Weight (grams)") + # Changes y-axis label
xlab("Mother's Age (years)") + # Changes x-axis label
ggtitle("Birth Weight by Mother's Age") # Changes plot title
Recall that when we calculated correlations between birth weight and mother’s age, we got the following results.
birthwt %>%
group_by(mother.smokes) %>%
summarize(cor_bwt_age = cor(birthwt.grams, mother.age))
## # A tibble: 2 x 2
## mother.smokes cor_bwt_age
## <fct> <dbl>
## 1 no 0.201
## 2 yes -0.144
This tells us that the association between mother’s age and the baby’s birthweight seems to depend on the mother’s smoking status. Among mothers that smoke, this association is negative (older mothers tend to give birth to lower weight babies), while among mothers that don’t smoke, the association is positive.
We can read off the same conclusions more simply from the plot. The regression lines capture the association between birthweight and mother’s age. The fact that the observed points are very spread out around the regression lines visually conveys the fact that the correlation between birthweight and mother’s age is not very large.
The plot shows us something that the correlation calculation cannot: there’s a non-smoking mother who was 45 years old when she gave birth, and the combination of her age and the baby’s weight (around 5000g) are well outside the range of all other data points. Let’s see what happens when we remove this observation and re-plot.
# Get new data set that no longer contains the outlier
birthwt.sub <- subset(birthwt, subset = mother.age < 40)
ggplot(birthwt.sub, aes(x=mother.age, y=birthwt.grams, shape=mother.smokes, color=mother.smokes)) +
geom_point() + # Adds points (scatterplot)
geom_smooth(method = "lm") + # Adds regression lines
ylab("Birth Weight (grams)") + # Changes y-axis label
xlab("Mother's Age (years)") + # Changes x-axis label
ggtitle("Birth Weight by Mother's Age") # Changes plot title
The general trends appear to be qualitatively the same, but the association in the non-smoking group appears to be far less strong than we were led to believe when we still had the outlying point in our data.
For this example we’re going to use some data based on the 2011 Consumer Expenditure (CE) Survey. This data is stored in a tab-delimited file called mtbi.txt
.
expenses <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/mtbi.txt",
sep="\t", header=TRUE)
dim(expenses)
## [1] 579 4
head(expenses, 20)
## ucc count avg.cost descr
## 480 790240 14418 335.99369 AVG FOOD/NONALC BEV EXPENSES
## 479 790210 11865 470.72061 REGULAR GROC SHOPPING INCL GOODS
## 291 470111 10779 221.60729 GASOLINE
## 482 790410 9309 192.62423 DINING OUT AT REST., ETC EXCL ALC
## 93 270310 8739 72.29626 CABLE/SATELLITE/COM ANTENNA SERV
## 86 270102 8575 95.50787 CELLULAR PHONE SERVICE
## 460 690114 7903 38.72745 COMPUTER INFORMATION SERVICES
## 28 220211 7740 229.23689 PROPERTY TAXES OWND
## 571 910050 7725 108.06803 RENTAL EQUIVALENCE OF OWNED HOME
## 575 910104 7680 1281.71706 CPI ADJ RENT EQUIV OF OWNED HOME
## 501 800721 7677 19309.76019 MARKET VALUE OF OWNED HOME
## 78 260112 7528 136.53892 ELECTRICITY OWND
## 85 270101 7180 53.18663 RESIDENTIAL TELEPHONES/PAY PHONES
## 434 650310 7008 39.88927 PERS. CARE SERV.
## 90 270212 6735 44.15650 WATER AND SEWERAGE MAINT OWND
## 30 220311 4916 612.51098 MORTGAGE INTEREST OWND
## 520 830201 4835 -325.53485 REDUCTION MORTGAGE PRINC OWND
## 348 540000 4485 78.87893 PRESCRIPTION DRUGS
## 82 260212 4308 81.45265 UTILITY--NATURAL GAS OWND
## 528 850100 4212 -331.67379 REDUCTION PRINC VEH LOAN
The data is ordered from highest to lowest count (the count is then number of times that an item in the given ucc category was purchased). From head()
we can see that the 20 most common expenses are things like food/drink, gas, cell phone bills, mortgage payments, etc. This is what we’d expect.
Now, let’s create a dot plot showing the counts for each of the top 20 expenses.
expense.plot <- ggplot(data = expenses[1:20, ], mapping = aes(y = descr, x = count))
expense.plot + geom_point()
This doesn’t look very good… What happened?
Well, R
doesn’t know that the count
variable is important for the purpose of ordering. The default behaviour is to show the plot with factor levels ordered alphabetically. This makes sense in many cases, but not in ours.
To re-order the levels of a factor, we use the reorder()
function. Here’s how we re-order the levels of descr
based on the value of count
.
expenses <- mutate(expenses, descr = reorder(descr, count))
Now let’s try that plotting function again…
expense.plot <- ggplot(data = expenses[1:20, ], mapping = aes(y = descr, x = count))
expense.plot + geom_point()
This kind of data is typically represented with bar charts instead of dot plots. Let’s create a bar chart to capture the same information.
barchart.fig = ggplot(data = expenses[1:20, ], mapping = aes(x = descr, y = count))
barchart.fig + geom_bar(stat = "identity", fill = cbPalette[3])
There’s an issue with this plot: the labels along the x-axis have all blended together and are incomprehensible.
To adjust the text, we can use the theme
command.
barchart.fig + geom_bar(stat = "identity", fill = cbPalette[3]) +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
We should probably also change the axis labels and title to something more meaningful
barchart.fig + geom_bar(stat = "identity", fill = cbPalette[3]) +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) +
labs(x = "", y = "Times reported", title = "Most Commonly Reported Purchases")
We’ll need the maps
library for this example.
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
One of the data sets that comes with R is the USArrests
data
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
Here’s how we can get a headmap of Murder rates (per 100,000 population) on a map of the US.
# Create data frame for map data (US states)
states <- map_data("state")
# Here's what the states data frame looks like
str(states)
## 'data.frame': 15537 obs. of 6 variables:
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ subregion: chr NA NA NA NA ...
# Make a copy of the data frame to manipulate
arrests <- USArrests
# Convert everything to lower case
names(arrests) <- tolower(names(arrests))
arrests$region <- tolower(rownames(USArrests))
# Merge the map data with the arrests data based on region
choro <- merge(states, arrests, sort = FALSE, by = "region")
choro <- choro[order(choro$order), ]
# Plot a map, filling in the states based on murder rate
qplot(long, lat, data = choro, group = group, fill = murder,
geom = "polygon")
Essentially what’s happening here is that the map data (here called states
) includes the latitute and longitude coordinates for the boundaries of each state. Specifying geom = "polygon"
in the qplot
function results in the states being traced out based on the lat/long coordinates. The arrests
data provides crime rates for all the states, which allows us to color each polygon (state) based on the murder rate.
It may be counter-intuitive that light blue indicates areas with high murder rates while dark blue indicates areas with low murder rates. We can fix this by specifying a different gradient using the scale_colour_gradient()
command. Here’s an example.
qplot(long, lat, data = choro, group = group, fill = murder,
geom = "polygon") +
scale_fill_gradient(low = "#56B1F7", high = "#132B43")
All we’ve done here is swapped the defaults for low
and high
.