During the past two weeks we have gotten familiar with simple datasets that contain information on the case and death counts during the COVID-19 pandemic in the USA1 https://github.com/nytimes/covid-19-data and in Italy2 https://github.com/pcm-dpc/COVID-19. The dataset from Italy contains information on hospitalizations and test that is not quite as easily available for the USA. In the last sentence, the emphasis is on easily: you might very well be able to get hold of information of this type, relying on curation carried out by different research groups, or focusing on some states only3 For example, for California, as of July 8, you can take a look at the data available at https://data.ca.gov/group/covid-19; more information on what this data represents and how it is collected can be found https://covid19.ca.gov/data-and-tools/. This week you are invited to look at data from the USA and Italy at the same time and identify common patterns and/or differences.
The primary medium we are going to use to facilitate comparisons between what happened in Italy and in the USA (or in separate areas of these two countries) is going to be graphical displays. The pre-work for this week provided you with a number of resources to understand from experts what makes a good graph. In order to practice those principles is useful to have a good tool in hand. R
has very very good graphical tools. This is not a surprise, as R
is the child of S
, which was a statistical language created at Bell Labs by John Chambers (who you will meet this summer) and Rick Becker and Allan Wilks in the same environment who nurtured people like Bill Cleveland4 Cleveland (1985) The Elements of Graphing Data; Cleveland (1993) Visualizing data, and that put at the forefront a way of doing data analysis inspired by Tukey5 Tukey (1977) Exploratory Data Analysis, who gave great importance to graphical displays.
These days, the package ggplot2
is often the go-to tool for graphics in R
. There are a number of resources on how to use ggplot2
6 One place to start might be Chapter 3 of R for data science, co-written by the author of ggplot; Chapter 3 in S. Holmes & W. Huber “Modern Statistics for Modern Biology” gives an introduction to R
graphics and ggplot
, reaching some fairly advance places. To be totally honest with you, I do not think that ggplot2
is perfect, but it is powerful and once you have a good command of it—having learned how to play with its functions—you can make really high quality graphs. The trick, as with any respectable tool, is to learn how to use it well, so that it enables you to do what you want, rather than being limited by its constraints.7 Laurel Stell’s webpage hosts materials from two short courses on graphics in R: Intro to ggplot2 and Advanced graphics in R. You will find it useful that the .Rmd
files used to create these presentations are available. These are good examples of what it means to master a tool. You might be interested in knowing that Ben is an R
guru and he can be very helpful with ggplot2
. As for python
, my bet is that you can do whatever you want in python
, but my experience looking at graphics generated with python
is that it might take more work on your part to get to high quality displays. So this week everyone of you is invited to play a bit with ggplot2
(but you are very welcome to prove me wrong turning in some stunning python graphics).
The name ggplot2
cames from “grammar of graphics,” a term introduced by Wilkinson8 Wilkinson (2005) The Grammar of Graphics (another member of the Bell Labs gang) to indicate “rules for constructing graphs mathematically and then representing them as graphics aesthetically.” “Such a grammar allows us to move beyond named graphics” and ggplot2
is “an open source implementation of the grammar of graphics for R.”9 Wickham The Layered Grammar of Graphics ggplot2
is part of the Tidyverse
10 https://www.tidyverse.org, “an opinionated collection of R packages designed for data science.” The packages are very useful, very popular11 H. Wickham received the COPSS award in 2019, with rather difficoult naming conventions and, in my opinion, not enough examples in the help entries. ggplot2
, with its layered structure, where different elements of the displays are combined with “+”, allows you to obtain sophisticated graphics with remarkably little effort. This is both a good and a bad thing. It is good to be able to get to the finishing line quickly. But “professional looking” graphics are not necessarily good graphics. You still have to think carefully about what variables your are plotting and which display you are using—a “good look” might make us think we have arrived, when we are still at the beginning. Another downside is the price one has to pay for having a program that “understands so quickly what you want”: if what you want is not what the programs is written to understand, it might take a little arm-wrestling to re-direct succesfully.
I personally love the magic that happens when you add + facet_grid(...)
to the current plot: ggplot2
is great for small multiples. In contrast, I do not like the default color scale: unless you have very few items, it becomes very difficoult to separate them out12 This website gives a useful display of color options in R. The default theme, which equippes every graph with a gray background, might be appropriate for screen viewing, but it is not effective for printing or slide displays.
Create two graphical displays that “show comparisons, contrast, differences” in any aspect of the data on the COVID-19 pandemic that you choose to analyze. Be inspired by Matisse’s quote: “I do not paint things, I paint only the differences between things.” Avoid the “non-information overload”: make sure that (1) there are no distracting elements in your displays and that (2) you are providing the reader with a maximum of information, enabling comparisons and reasoning on the data.
The pre-work for the last two weeks provided you with some example graphs in ggplot. You can use those as a starting point if you feel uninspired: there is much to improve and comparisons and context to add. Think of yourself as a public health investigator and unleash your creativity!
This exercise will work best if you all use .Rmd
and R
for your plots. This way we can easily share code and learn tricks from each other (how did you get the legend to look like that? how did you assign colors that way? how did you generate that display?)… thanks for giving it a try.
This is the second time you are invited to make some displays of this data, and it is not going to be the last. As everything else, learning graphical excellence is a process: be patient and keep pushing your boundaries, where ever they are, learning and trying something new. Maybe take one of the statements that most surprised you in the reading, and try to work with it, see how your graphs would change if you decided to follow it.