Intro to the Tidyverse

What’s `TidyTuesday`?

Join the R4DS Online Learning Community in the weekly #TidyTuesday event! Every week we post a raw dataset, a chart or article related to that dataset, and ask you to explore the data. While the dataset will be “tamed”, it will not always be tidy! As such you might need to apply various R for Data Science techniques to wrangle the data into a true tidy format. The goal of TidyTuesday is to apply your R skills, get feedback, explore other’s work, and connect with the greater #RStats community! As such we encourage everyone of all skills to participate!

Set up the notebook options

Load the Data

base_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-15"

image_alt <- readr::read_csv(glue::glue("{base_url}/image_alt.csv"))

Rows: 90 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): measure, client, date
dbl (2): percent, timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

color_contrast <- readr::read_csv(glue::glue("{base_url}/color_contrast.csv"))

Rows: 90 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): measure, client, date
dbl (2): percent, timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ally_scores <- readr::read_csv(glue::glue("{base_url}/ally_scores.csv"))

Rows: 90 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): measure, client, date
dbl (6): p10, p25, p50, p75, p90, timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bytes_total <- readr::read_csv(glue::glue("{base_url}/bytes_total.csv"))

Rows: 457 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): measure, client, date
dbl (6): p10, p25, p50, p75, p90, timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

speed_index <- readr::read_csv(glue::glue("{base_url}/speed_index.csv"))

Rows: 238 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): measure, client, date
dbl (6): p10, p25, p50, p75, p90, timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

First trick: instead of copy-pasting the URL everywhere, we used the {glue} package to combine the strings. This is reminiscent of f-strings in Python.

Let’s start by inspecting the speed_index:

speed_index |>
  select(timestamp) |>
  head()

# A tibble: 6 × 1
      timestamp
          <dbl>
1 1664582400000
2 1664582400000
3 1661990400000
4 1661990400000
5 1659312000000
6 1659312000000

We used the |> pipe operator. This is the native pipe, available since R>=4.1. You can also use {magrittr}’s %>% too! Then we used the select() function to extract one column only.

The timestamp is recorded as milliseconds from the epoch (January first, 1970).

speed_index |>
  mutate(
    date = as.POSIXct(timestamp / 1000, origin = lubridate::origin, tz = "UTC")
    ) |>
  select(date) |>
  head()

# A tibble: 6 × 1
  date               
  <dttm>             
1 2022-10-01 00:00:00
2 2022-10-01 00:00:00
3 2022-09-01 00:00:00
4 2022-09-01 00:00:00
5 2022-08-01 00:00:00
6 2022-08-01 00:00:00

We used mutate() to create a new column, but we could have used it to modify the column in place.
We transformed the timestamp into a date; however we already have such a column!

Let’s write a function to apply to our dataset:

prep_data <- . %>%
  select(-timestamp) %>%
  mutate(date = lubridate::ymd(date))

This is a sort of anonymous function (see the .): for it to work, we must use {magrittr}’s pipe:

try replacing the %>% with |> and see what happens

ally_scores <- readr::read_csv(glue::glue("{base_url}/ally_scores.csv")) |> prep_data()

Rows: 90 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): measure, client, date
dbl (6): p10, p25, p50, p75, p90, timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bytes_total <- readr::read_csv(glue::glue("{base_url}/bytes_total.csv")) |> prep_data()

Rows: 457 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): measure, client, date
dbl (6): p10, p25, p50, p75, p90, timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

speed_index <- readr::read_csv(glue::glue("{base_url}/speed_index.csv")) |> prep_data()

Rows: 238 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): measure, client, date
dbl (6): p10, p25, p50, p75, p90, timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

image_alt <- readr::read_csv(glue::glue("{base_url}/image_alt.csv")) |> prep_data()

Rows: 90 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): measure, client, date
dbl (2): percent, timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

color_contrast <- readr::read_csv(glue::glue("{base_url}/color_contrast.csv")) |> prep_data()

Rows: 90 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): measure, client, date
dbl (2): percent, timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

speed_index %>% head()

# A tibble: 6 × 8
  measure    client  date         p10   p25   p50   p75   p90
  <chr>      <chr>   <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
1 speedIndex desktop 2022-10-01  1.59  2.42  3.88  6.45  10.5
2 speedIndex mobile  2022-10-01  2.92  4.03  5.87  8.85  13.3
3 speedIndex desktop 2022-09-01  1.61  2.45  3.91  6.5   10.6
4 speedIndex mobile  2022-09-01  2.92  4.04  5.88  8.86  13.3
5 speedIndex desktop 2022-08-01  1.62  2.48  3.96  6.56  10.6
6 speedIndex mobile  2022-08-01  2.98  4.16  6.12  9.37  14.2

Data Exploration with Percentiles

speed_index %>%
  ggplot(aes(date, p50, color=client)) +
  geom_line() +
  geom_ribbon(aes(ymin=p25, ymax=p75), alpha=0.2) +
  labs(
    title="Speed by Client",
  ) +
  theme(plot.title.position = 'plot')

ribbon_plot <- function(data, title) {
  data %>% 
    ggplot(aes(date, p50, color= client, fill=client)) +
    geom_line() +
    geom_ribbon(aes(ymin=p25, ymax=p75), alpha=0.2) +
    labs(
      title=title,
      subtitle="25th and 75th percentile by Client",
      y="",
    ) +
    theme(plot.title.position = 'plot')
} 
  
ally_scores %>% ribbon_plot(title="Accessibility Scores")

bytes_total %>% ribbon_plot(title="Total bytes")

Percentage Measures

image_alt %>%
  ggplot(aes(date, percent, color=client)) +
  geom_line()

color_contrast %>%
  ggplot(aes(date, percent, color=client)) +
  geom_line()

Other visualisations

image_alt %>%
  ggplot(aes(percent, fill=client)) +
  geom_density(color="white", alpha=0.3)

image_alt %>%
  ggplot(aes(percent, color=client)) +
  geom_histogram(aes(fill=client), color="white", alpha=0.3)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

More sophisticated manipulations

combined_percentiles <- bind_rows(speed_index, bytes_total, ally_scores)

combined_percentiles %>% head()

# A tibble: 6 × 8
  measure    client  date         p10   p25   p50   p75   p90
  <chr>      <chr>   <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
1 speedIndex desktop 2022-10-01  1.59  2.42  3.88  6.45  10.5
2 speedIndex mobile  2022-10-01  2.92  4.03  5.87  8.85  13.3
3 speedIndex desktop 2022-09-01  1.61  2.45  3.91  6.5   10.6
4 speedIndex mobile  2022-09-01  2.92  4.04  5.88  8.86  13.3
5 speedIndex desktop 2022-08-01  1.62  2.48  3.96  6.56  10.6
6 speedIndex mobile  2022-08-01  2.98  4.16  6.12  9.37  14.2

combined_percentiles %>% count(measure)

# A tibble: 3 × 2
  measure        n
  <chr>      <int>
1 a11yScores    90
2 bytesTotal   457
3 speedIndex   238

combined_percentiles %>% 
  ggplot(aes(date, p50, fill=client, color=client)) +
  geom_line() +
  geom_ribbon(aes(ymin=p25, ymax=p50), alpha=0.2) +
  facet_wrap(~ measure, scales="free") +
  labs(
      title="Speed, Bytes and Accessibility",
      subtitle="25th and 75th percentile by Client",
      y="",
    ) +
  theme(plot.title.position = 'plot')

bind_rows(image_alt, color_contrast) %>% 
  ggplot(aes(percent)) +
  geom_density(aes(fill=client), alpha=0.3, color="white") +
  facet_wrap(~ measure, scales="free") +
  scale_y_continuous(labels= percent_format()) +
  labs(
    title="**Accessibility: Alt Text and Color Contrast**",
    subtitle="*Percentage by client*",
    y="",
    ) +
  ggExtra::removeGrid() +
  theme(
    plot.title.position = 'plot',
    plot.title = ggtext::element_markdown(),
    plot.subtitle = ggtext::element_markdown(),
  )

What’s TidyTuesday?

Set up the notebook options

Load the Data

Data Exploration with Percentiles

Percentage Measures

Other visualisations

More sophisticated manipulations

What’s `TidyTuesday`?