Data Visualization with ggplot2

Packages

Aesthetics

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))

Facet Wrap

To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Facet Grid

Facet plot on the combination of two variables

# MPG data, 
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color=class)) +
  facet_grid(drv ~ cyl)

Exercises

  1. What happens if you facet on a continuous variable?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ cty, nrow = 2)

2. What do the empty cells in a plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) +
  geom_point(mapping = aes(x = drv, y = cyl))

Geometric Objects

  • geom is the geometrical object that a plot uses to represent data
  • For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on.
  • To change the geom in your plot, change the geom function that you add to ggplot().
  • The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at https://www.rstudio.com/resources/cheatsheets/.
# left
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

# right
ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype:

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms:

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, color = drv),
    show.legend = TRUE
  )

To display multiple geoms in the same plot, add multiple geom functions to ggplot()

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

You can pass a set of mappings to ggplot() to avoid duplicate variables

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

You can pass additional mappings to a geom function. This makes it possible to display different aesthetics in different layers.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth()

Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth(
    data = filter(mpg, class == "subcompact"),
    se = FALSE
  )

Exercises

  1. How to create line / boxplot / histogram / area chart?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_smooth()

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_boxplot(aes(group=cyl))

ggplot(data = mpg, mapping = aes(x = hwy)) +
  geom_histogram(bins=10)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_area()

  1. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions:
ggplot(
  data = mpg,
  mapping = aes(x = displ, y = hwy, color = drv)
) +
  geom_point() +
  geom_smooth(se = FALSE)

  1. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?
  • ANSWER: Removes the legend from the chart
  1. What does the se argument to geom_smooth() do?
  • ANSWER: se displays the confidence interval
  1. Will these two graphs look different? Why/why not?
  • ANSWER: They should look the same. The mapping is just different (global vs local)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

ggplot() +
  geom_point(
    data = mpg,
    mapping = aes(x = displ, y = hwy)
  ) +
  geom_smooth(
    data = mpg,
    mapping = aes(x = displ, y = hwy)
  )

  1. Recreate R code necessary to generate graphs in the book.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv), se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv), se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_smooth(se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv), se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(x = displ, y = hwy, color = drv))

Statistical Transfomrations

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. - Bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin. - Smoothers fit a model to your data and then plot predictions from the model. - Boxplots compute a robust summary of the distribution and display a specially formatted box.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

You can use geoms and stats interchangeably. This works because every geom has a default stat, and every stat has a default geom.

ggplot(data = diamonds) +
  stat_count(mapping = aes(x = cut))

You can display by proportion rather than count

ggplot(data = diamonds) +
  geom_bar(
    mapping = aes(x = cut, y = ..prop.., group = 1)
  )

stat_summary() summarizes y values for each unique x value

ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

Exercises

  1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
  • ANSWER:
ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median,
    stat = "summary"
  )

  1. What does geom_col() do? How is it different to geom_bar()?
  • ANSWER: The default stat of geom_col() is stat_identity(), which leaves the data as is. The geom_col() function expects that the data contains x values and y values which represent the bar height. No transformation is done to the data, unlike geom_bar()
demo <- tribble(
  ~a,      ~b,
  "bar_1", 20,
  "bar_2", 30,
  "bar_3", 40
)

ggplot(data=demo) +
  geom_col(mapping = aes(x=a, y=b))

  1. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
geom stat
geom_bar() stat_count()
geom_bin2d() stat_bin_2d()
geom_boxplot() stat_boxplot()
geom_contour_filled() stat_contour_filled()
geom_contour() stat_contour()
geom_count() stat_sum()
geom_density_2d() stat_density_2d()
geom_density() stat_density()
geom_dotplot() stat_bindot()
geom_function() stat_function()
geom_sf() stat_sf()
geom_sf() stat_sf()
geom_smooth() stat_smooth()
geom_violin() stat_ydensity()
geom_hex() stat_bin_hex()
geom_qq_line() stat_qq_line()
geom_qq() stat_qq()
geom_quantile() stat_quantile()
  1. What variables does stat_smooth() compute? What parameters control its behavior?
  • ANSWER: stat_smooth() is the same as geom_smooth()
  1. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
  • ANSWER: group is required as the geom_bar() assumes all groups are equal to the x values since stat computes the counts within the group. To get proportions, you need to pass the group to split out the stacked bar chart.

Position Adjustments

You can color a bar chart using color aesthetic, or fill. Adding a categorical variable to y with fill with show a stacked bar for each.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, color = cut))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity))

Stacking is performed automatically by position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: identity, dodge or fill:

  1. position = identity places each object exactly where it falls in the context of the graph. Not useful for bars, because overlap. Need to set alpha to show each. More useful for scatter plots.
ggplot(
  data = diamonds,
  mapping = aes(x = cut, fill = clarity)
) +
  geom_bar(alpha = 1/5, position = "identity")

2. position = "fill" works lik stacking, but makes each set of stacked bar the same height.

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, fill=clarity), position="fill")

3. position = "dodge" places overlapping objects directly beside one another. Easier to compare individual values.

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, fill=clarity), position="dodge")

4. position = "jitter" is useful for scatterplots. It adds random jitter to each plot so that we can see all data points without overlap.

ggplot(data = mpg) +
  geom_point(mapping = aes(x=displ, y=hwy, color=class), position="jitter")

#### Exercises

  1. What is the problem with this plot? How could you improve it?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point()

- ANSWER: It is gridded, needs jitter:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color=class)) +
  geom_point(position="jitter")

### Coordinate Systems

Default coordinate is cartesian system, but there are others.

  1. coord_flip(): Switches x-y axes.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot() +
  coord_flip()

2. coord_quickmap(): Sets correct aspect ratio for maps.

nz <- map_data("nz")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black") +
  coord_quickmap() +
  labs(title="Map of New Zealand", x="x coord", y="y coord")

  1. coord_polar(): Uses polar coordinates. The labs function adds axis titles, plot titles, and a caption to the plot.
bar <- ggplot(data = diamonds) +
  geom_bar(
    mapping = aes(x = cut, fill = cut),
    show.legend = FALSE,
    width = 1
  ) +
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()

bar + coord_polar()

  1. What does the following plot tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline() +
  coord_fixed()

- ANSWER: coord_fixed ensures that the line produced by geom_abline() is at a 45-degree angle. Studies have shown that humans perceives differences in angles relative to 45 degrees.

Layered Grammar of Graphics

A template of all the plotting elements. The seven parameters in the template compose the grammar of graphics, a formal system for building plots.

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>,
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

APPENDIX

Exporting notebooks via Knit

  • In RStudio, you must use the Knit button, not “Knit on Save”. The latter throws an infuriating error message that doesn’t tell you what has gone wrong.
  • Knitting to pdf requires pdflatex. This should be able to be done via tinytex but that package seems to be finicky.
  • Worst-case scenario: Just Knit to HTML and then export it to PDF through the browser.
  • Actually installing it through a specific mirror worked: tinytex::install_tinytex(repository = "http://mirrors.tuna.tsinghua.edu.cn/CTAN/", version = "latest")