Tidyverse Fundamentals: {ggplot}

Inspect the data

Let’s see a pre-built dataset:

mpg
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# … with 224 more rows
# if you want to inspect the data:
# ?mpg

Bare plot syntax

Then we create a simple plot:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

mapping is a special argument: combined with the aes() function, it maps variables in the data to plotting elements. We can also pass other arguments to mapping:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, colour = class))

Some other arguments we can use:

  • size (not recommended for quantitative variables).
  • alpha, i.e. transparency.
  • shape

shapes

You can see the docs for a complete reference.

We can also pass these very same arguments outside of the mapping argument, to define behaviours that are independent of the data:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), colour="blue", alpha=0.8, shape=1)

We can use {magrittr}’s pipe operator (%>%) or the native-pipe (|>) to pass the data as an argument:

mpg %>%
  ggplot() +
  geom_point(mapping = aes(x = displ, y = hwy))

mpg |> 
  ggplot() +
  geom_point(mapping = aes(x = displ, y = hwy))

The mapping argument

We can also specify the mapping argument inside the ggplot call:

mpg %>%
  ggplot(mapping = aes(x = displ, y = hwy)) +
  geom_point()

The aestethics specified in this way will be inherited by default in every subsequent geom_* (geometry). We can always override them in the geom_* function.

Scales

{ggplot} also has elegant methods to change the scales of a plot:

mpg %>% 
  ggplot(aes(x=displ, y=hwy)) +
  geom_point(aes(color=class, shape=drv), size=2) +
  scale_colour_viridis_d() +
  scale_y_log10() + 
  labs(
    title = "Engine displacement vs Highway consumption",
    subtitle = "Y is in log scale",
    y = "",
  ) +
  theme(
    plot.title.position = "plot",
    legend.position = "left"
  )

To interpret the scale_* layers, remember the following:

  • The second component denotes the geometry or mapping. In this case:
    • scale_colour will affect the color.
    • scale_y will affect the scale of the y variable.
  • The third argument denotes the type of transformation applied:
    • viridis_d denotes that the viridis discrete colormap is used.
    • log10 applies a log10 transform to the axis.
      • Note that the axis’ gridlines are scaled accordingly! Something like this would not have happened if we had transformed the y parameter directly. This is quite important, as it would have made interpreting the plot much more unintuitive.

Coordinate planes

Here we use an example of a slightly more advanced data transformation. The mutate function (from the {dplyr} package) can apply a the same transformation to multiple columns. Instead of manually writing down every column, we can use the across() function. This function has a .cols argument and a .fns argument. To the first one, we pass the where(is_character) function: this will automatically filter out all the columns of our dataset that are of the type character. The second argument denotes the transformation we will apply to it, in this case a type conversion into factors.

With a simple + coord_*, we can change the coordinates of our plots:

mpg %>% 
  # change the column type from char to factor
  mutate(across(.cols = where(is_character), .fns = as_factor)) %>% 
  ggplot(aes(x=manufacturer)) +
  geom_bar(aes(fill=drv)) +
  scale_fill_viridis_d() +
  coord_polar()

mpg %>% 
  # change the column type from char to factor
  mutate(across(.cols = where(is_character), .fns = as_factor)) %>% 
  ggplot(aes(x=manufacturer)) +
  geom_bar(aes(fill=drv)) +
  scale_fill_viridis_d() +
  coord_flip()

While coord_polar() might be exotic or more niche, the coord_flip is an elegant solution to swap the axis of our plot. We could simply invert the x and y arguments of the aes() mapping, but this does not always work (e.g. with geom_bar, that only takes an x argument and computes the y “under the hood”).

Facets

the facet_wrap() “layer” can be used to split the plot in different facets.

The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

You can also create a grid, to facet the plot on the combination of two variables:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Geoms

The geom attribute describes the data representation. Here we use a small extension package to display side by side two plot objects:

library(patchwork)

p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  labs(title = "Scatterplot") + 
  theme(plot.title.position = "plot")

p2 <- ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy)) +
  labs(title = "Smooth Regression Line") +
  theme(plot.title.position = "plot")

p1 | p2
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As said above, the geoms can take a mapping argument. However, not all arguments can be used in every geom: for example, the shape argument cannot be passed to a geom_line().

Sometimes it’s much better to combine mappings to make your visualisations much simpler to grasp:

p2a <- ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv)) +
  labs(
    title = "Smooth Regression Line",
    subtitle = "By type of drive train (no color)"
  ) +
  theme(plot.title.position = "plot")
    
p2b <- ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, color = drv),
    show.legend = FALSE
  ) +
  labs(
    title = "Smooth Regression Line",
    subtitle = "By type of drive train (with color)"
    ) +
  theme(plot.title.position = "plot")

p2a | p2b
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can also layer them:

mpg %>% 
  ggplot(mapping = aes(x = displ, y = hwy)) + 
  geom_point() +
  geom_smooth() +
  labs(
    title = "Smooth Regression Line + Scatterplot",
    ) +
  theme(plot.title.position = "plot")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In this case, setting a “global” mapping can help remove code duplication. We can still set custom aesthetics for each layer:

mpg %>% 
  ggplot(mapping = aes(x = displ, y = hwy)) + 
  geom_point(aes(color=class)) +
  geom_smooth() +
  labs(
    title = "Smooth Regression Line + Scatterplot",
    ) +
  theme(plot.title.position = "plot")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can also change the data in each layer:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = mpg %>% filter(class == "subcompact"), se = FALSE) +
  labs(
    title = "Smooth regression line + scatterplot",
    subtitle = "Smooth line fitted only on `class == 'subcompact'`"
  ) +
  theme(plot.title.position = "plot")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Extensions to {ggplot}

{ggplot} is so influential and versatile that many packages were written to extend its functionalities, or wrap them to build more advanced visualisations. You can see a list here.

Visualise Models

  • The {parameters} and {see} packages are part of the {easystats} framework, to make statistical plotting and modelling easier.
  • The {ggsci} package implements themes specific to some publication journals.
library(parameters)
library(see)

mpg %>%
  # turn columns into factors
  mutate(across(where(is.character), as_factor)) %>% 
  # fit a regression model
  lm(hwy ~ manufacturer + year + cyl + fl + class, data = .) %>% 
  parameters() %>% 
  plot() +
  ggplot2::labs(title = "A Dot-and-Whisker Plot") + 
  ggsci::scale_color_npg()
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

They can also be quite sophisticated, such as {ggstatsplot}:

ggstatsplot::ggbetweenstats(
  data  = iris,
  x     = Species,
  y     = Sepal.Length,
  title = "Distribution of sepal length across Iris species"
)

Note how intuitive it is to obtain such complex visualisations:

iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length)) +
  geom_violin() +
  geom_boxplot(alpha = 0.2) +
  geom_jitter()