Tidyverse Fundamentals: {dplyr}

Setup

A tibble

flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

You might also have noticed the row of three (or four) letter abbreviations under the column names. These describe the type of each variable:

  • int stands for integers.
  • dbl stands for doubles, or real numbers.
  • chr stands for character vectors, or strings.
  • dttm stands for date-times (a date + a time).

There are three other common types of variables that aren’t used in this dataset but you’ll encounter later in the book:

  • lgl stands for logical, vectors that contain only TRUE or FALSE.
  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
  • date stands for dates.

First basic {dplyr} verbs

Filter

filter(flights, month == 1, day == 1)
# A tibble: 842 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

For simplicity, we can use the %>% (pipe) operator:

flights %>% filter(month == 11 | month == 12)
# A tibble: 55,403 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013    11     1        5       2359       6     352     345       7 B6     
 2  2013    11     1       35       2250     105     123    2356      87 B6     
 3  2013    11     1      455        500      -5     641     651     -10 US     
 4  2013    11     1      539        545      -6     856     827      29 UA     
 5  2013    11     1      542        545      -3     831     855     -24 AA     
 6  2013    11     1      549        600     -11     912     923     -11 UA     
 7  2013    11     1      550        600     -10     705     659       6 US     
 8  2013    11     1      554        600      -6     659     701      -2 US     
 9  2013    11     1      554        600      -6     826     827      -1 DL     
10  2013    11     1      554        600      -6     749     751      -2 DL     
# … with 55,393 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
flights %>% filter(month %in% c(11, 12))
# A tibble: 55,403 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013    11     1        5       2359       6     352     345       7 B6     
 2  2013    11     1       35       2250     105     123    2356      87 B6     
 3  2013    11     1      455        500      -5     641     651     -10 US     
 4  2013    11     1      539        545      -6     856     827      29 UA     
 5  2013    11     1      542        545      -3     831     855     -24 AA     
 6  2013    11     1      549        600     -11     912     923     -11 UA     
 7  2013    11     1      550        600     -10     705     659       6 US     
 8  2013    11     1      554        600      -6     659     701      -2 US     
 9  2013    11     1      554        600      -6     826     827      -1 DL     
10  2013    11     1      554        600      -6     749     751      -2 DL     
# … with 55,393 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
flights %>% filter(!is.na(dep_time))
# A tibble: 328,521 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 328,511 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

To remove all missing values, we can use the following:

flights %>% tidyr::drop_na()
# A tibble: 327,346 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 327,336 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Arrange

flights %>% arrange(year, month, day)
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
flights %>% arrange(desc(dep_delay))
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     9      641        900    1301    1242    1530    1272 HA     
 2  2013     6    15     1432       1935    1137    1607    2120    1127 MQ     
 3  2013     1    10     1121       1635    1126    1239    1810    1109 MQ     
 4  2013     9    20     1139       1845    1014    1457    2210    1007 AA     
 5  2013     7    22      845       1600    1005    1044    1815     989 MQ     
 6  2013     4    10     1100       1900     960    1342    2211     931 DL     
 7  2013     3    17     2321        810     911     135    1020     915 DL     
 8  2013     6    27      959       1900     899    1236    2226     850 DL     
 9  2013     7    22     2257        759     898     121    1026     895 DL     
10  2013    12     5      756       1700     896    1058    2020     878 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Note that missing values are always sorted at the end.

Select

flights %>% select(year, month, day)
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# … with 336,766 more rows

We can specify a range of columns like this:

flights %>% select(year:day)
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# … with 336,766 more rows
flights %>% select(1:3)
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# … with 336,766 more rows

Or exclude a range in a similar way:

flights %>% select(-(year:day))
# A tibble: 336,776 × 16
   dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
     <int>   <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr> 
 1     517     515       2     830     819      11 UA        1545 N14228  EWR   
 2     533     529       4     850     830      20 UA        1714 N24211  LGA   
 3     542     540       2     923     850      33 AA        1141 N619AA  JFK   
 4     544     545      -1    1004    1022     -18 B6         725 N804JB  JFK   
 5     554     600      -6     812     837     -25 DL         461 N668DN  LGA   
 6     554     558      -4     740     728      12 UA        1696 N39463  EWR   
 7     555     600      -5     913     854      19 B6         507 N516JB  EWR   
 8     557     600      -3     709     723     -14 EV        5708 N829AS  LGA   
 9     557     600      -3     838     846      -8 B6          79 N593JB  JFK   
10     558     600      -2     753     745       8 AA         301 N3ALAA  LGA   
# … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#   variable names ¹​dep_time, ²​sched_dep_time, ³​dep_delay, ⁴​arr_time,
#   ⁵​sched_arr_time, ⁶​arr_delay
flights %>% select(-(year:day), -(5:7))
# A tibble: 336,776 × 13
   dep_time sched_…¹ arr_d…² carrier flight tailnum origin dest  air_t…³ dista…⁴
      <int>    <int>   <dbl> <chr>    <int> <chr>   <chr>  <chr>   <dbl>   <dbl>
 1      517      819      11 UA        1545 N14228  EWR    IAH       227    1400
 2      533      830      20 UA        1714 N24211  LGA    IAH       227    1416
 3      542      850      33 AA        1141 N619AA  JFK    MIA       160    1089
 4      544     1022     -18 B6         725 N804JB  JFK    BQN       183    1576
 5      554      837     -25 DL         461 N668DN  LGA    ATL       116     762
 6      554      728      12 UA        1696 N39463  EWR    ORD       150     719
 7      555      854      19 B6         507 N516JB  EWR    FLL       158    1065
 8      557      723     -14 EV        5708 N829AS  LGA    IAD        53     229
 9      557      846      -8 B6          79 N593JB  JFK    MCO       140     944
10      558      745       8 AA         301 N3ALAA  LGA    ORD       138     733
# … with 336,766 more rows, 3 more variables: hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_arr_time,
#   ²​arr_delay, ³​air_time, ⁴​distance

We can use a plethora of verbs inside select:

starts_with("abc"): matches names that begin with “abc”. ends_with("xyz"): matches names that end with “xyz”. contains("ijk"): matches names that contain “ijk”. matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings. num_range("x", 1:3): matches x1, x2 and x3.

flights %>% select(starts_with("dep"))
# A tibble: 336,776 × 2
   dep_time dep_delay
      <int>     <dbl>
 1      517         2
 2      533         4
 3      542         2
 4      544        -1
 5      554        -6
 6      554        -4
 7      555        -5
 8      557        -3
 9      557        -3
10      558        -2
# … with 336,766 more rows
flights %>% select(ends_with("delay"))
# A tibble: 336,776 × 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# … with 336,766 more rows
flights %>% select(contains("delay"))
# A tibble: 336,776 × 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# … with 336,766 more rows
flights %>% select(contains(c("delay", "dep", "arr")))
# A tibble: 336,776 × 7
   dep_delay arr_delay dep_time sched_dep_time arr_time sched_arr_time carrier
       <dbl>     <dbl>    <int>          <int>    <int>          <int> <chr>  
 1         2        11      517            515      830            819 UA     
 2         4        20      533            529      850            830 UA     
 3         2        33      542            540      923            850 AA     
 4        -1       -18      544            545     1004           1022 B6     
 5        -6       -25      554            600      812            837 DL     
 6        -4        12      554            558      740            728 UA     
 7        -5        19      555            600      913            854 B6     
 8        -3       -14      557            600      709            723 EV     
 9        -3        -8      557            600      838            846 B6     
10        -2         8      558            600      753            745 AA     
# … with 336,766 more rows

But we can also use the where() verb:

flights %>% select(where(is.numeric))
# A tibble: 336,776 × 14
    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ flight
   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl>  <int>
 1  2013     1     1      517         515       2     830     819      11   1545
 2  2013     1     1      533         529       4     850     830      20   1714
 3  2013     1     1      542         540       2     923     850      33   1141
 4  2013     1     1      544         545      -1    1004    1022     -18    725
 5  2013     1     1      554         600      -6     812     837     -25    461
 6  2013     1     1      554         558      -4     740     728      12   1696
 7  2013     1     1      555         600      -5     913     854      19    507
 8  2013     1     1      557         600      -3     709     723     -14   5708
 9  2013     1     1      557         600      -3     838     846      -8     79
10  2013     1     1      558         600      -2     753     745       8    301
# … with 336,766 more rows, 4 more variables: air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

We can use this to rearrange the columns in the data:

flights %>% select(where(is.numeric), where(is.character))
# A tibble: 336,776 × 18
    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ flight
   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl>  <int>
 1  2013     1     1      517         515       2     830     819      11   1545
 2  2013     1     1      533         529       4     850     830      20   1714
 3  2013     1     1      542         540       2     923     850      33   1141
 4  2013     1     1      544         545      -1    1004    1022     -18    725
 5  2013     1     1      554         600      -6     812     837     -25    461
 6  2013     1     1      554         558      -4     740     728      12   1696
 7  2013     1     1      555         600      -5     913     854      19    507
 8  2013     1     1      557         600      -3     709     723     -14   5708
 9  2013     1     1      557         600      -3     838     846      -8     79
10  2013     1     1      558         600      -2     753     745       8    301
# … with 336,766 more rows, 8 more variables: air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, carrier <chr>, tailnum <chr>, origin <chr>,
#   dest <chr>, and abbreviated variable names ¹​sched_dep_time, ²​dep_delay,
#   ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Or we can use the everything() function to denote all other columns.

flights %>% select(where(is.numeric), everything())
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ flight
   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl>  <int>
 1  2013     1     1      517         515       2     830     819      11   1545
 2  2013     1     1      533         529       4     850     830      20   1714
 3  2013     1     1      542         540       2     923     850      33   1141
 4  2013     1     1      544         545      -1    1004    1022     -18    725
 5  2013     1     1      554         600      -6     812     837     -25    461
 6  2013     1     1      554         558      -4     740     728      12   1696
 7  2013     1     1      555         600      -5     913     854      19    507
 8  2013     1     1      557         600      -3     709     723     -14   5708
 9  2013     1     1      557         600      -3     838     846      -8     79
10  2013     1     1      558         600      -2     753     745       8    301
# … with 336,766 more rows, 9 more variables: air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, carrier <chr>, tailnum <chr>, origin <chr>,
#   dest <chr>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

The where clause allows us to be more expressive:

flights %>% select(where(~ is.numeric(.x)))
# A tibble: 336,776 × 14
    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ flight
   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl>  <int>
 1  2013     1     1      517         515       2     830     819      11   1545
 2  2013     1     1      533         529       4     850     830      20   1714
 3  2013     1     1      542         540       2     923     850      33   1141
 4  2013     1     1      544         545      -1    1004    1022     -18    725
 5  2013     1     1      554         600      -6     812     837     -25    461
 6  2013     1     1      554         558      -4     740     728      12   1696
 7  2013     1     1      555         600      -5     913     854      19    507
 8  2013     1     1      557         600      -3     709     723     -14   5708
 9  2013     1     1      557         600      -3     838     846      -8     79
10  2013     1     1      558         600      -2     753     745       8    301
# … with 336,766 more rows, 4 more variables: air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Reordering and renaming columns

For reordering columns, however, there is a dedicated function:

flights %>% relocate(where(is.character), .before = 1)
# A tibble: 336,776 × 19
   carrier tailnum origin dest   year month   day dep_time sched_dep_t…¹ dep_d…²
   <chr>   <chr>   <chr>  <chr> <int> <int> <int>    <int>         <int>   <dbl>
 1 UA      N14228  EWR    IAH    2013     1     1      517           515       2
 2 UA      N24211  LGA    IAH    2013     1     1      533           529       4
 3 AA      N619AA  JFK    MIA    2013     1     1      542           540       2
 4 B6      N804JB  JFK    BQN    2013     1     1      544           545      -1
 5 DL      N668DN  LGA    ATL    2013     1     1      554           600      -6
 6 UA      N39463  EWR    ORD    2013     1     1      554           558      -4
 7 B6      N516JB  EWR    FLL    2013     1     1      555           600      -5
 8 EV      N829AS  LGA    IAD    2013     1     1      557           600      -3
 9 B6      N593JB  JFK    MCO    2013     1     1      557           600      -3
10 AA      N3ALAA  LGA    ORD    2013     1     1      558           600      -2
# … with 336,766 more rows, 9 more variables: arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, flight <int>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#   variable names ¹​sched_dep_time, ²​dep_delay
flights %>% relocate(where(is.character), .before = year)
# A tibble: 336,776 × 19
   carrier tailnum origin dest   year month   day dep_time sched_dep_t…¹ dep_d…²
   <chr>   <chr>   <chr>  <chr> <int> <int> <int>    <int>         <int>   <dbl>
 1 UA      N14228  EWR    IAH    2013     1     1      517           515       2
 2 UA      N24211  LGA    IAH    2013     1     1      533           529       4
 3 AA      N619AA  JFK    MIA    2013     1     1      542           540       2
 4 B6      N804JB  JFK    BQN    2013     1     1      544           545      -1
 5 DL      N668DN  LGA    ATL    2013     1     1      554           600      -6
 6 UA      N39463  EWR    ORD    2013     1     1      554           558      -4
 7 B6      N516JB  EWR    FLL    2013     1     1      555           600      -5
 8 EV      N829AS  LGA    IAD    2013     1     1      557           600      -3
 9 B6      N593JB  JFK    MCO    2013     1     1      557           600      -3
10 AA      N3ALAA  LGA    ORD    2013     1     1      558           600      -2
# … with 336,766 more rows, 9 more variables: arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, flight <int>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#   variable names ¹​sched_dep_time, ²​dep_delay
flights %>% relocate(where(is.character), .after = 3)
# A tibble: 336,776 × 19
    year month   day carrier tailnum origin dest  dep_time sched_dep_t…¹ dep_d…²
   <int> <int> <int> <chr>   <chr>   <chr>  <chr>    <int>         <int>   <dbl>
 1  2013     1     1 UA      N14228  EWR    IAH        517           515       2
 2  2013     1     1 UA      N24211  LGA    IAH        533           529       4
 3  2013     1     1 AA      N619AA  JFK    MIA        542           540       2
 4  2013     1     1 B6      N804JB  JFK    BQN        544           545      -1
 5  2013     1     1 DL      N668DN  LGA    ATL        554           600      -6
 6  2013     1     1 UA      N39463  EWR    ORD        554           558      -4
 7  2013     1     1 B6      N516JB  EWR    FLL        555           600      -5
 8  2013     1     1 EV      N829AS  LGA    IAD        557           600      -3
 9  2013     1     1 B6      N593JB  JFK    MCO        557           600      -3
10  2013     1     1 AA      N3ALAA  LGA    ORD        558           600      -2
# … with 336,766 more rows, 9 more variables: arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, flight <int>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#   variable names ¹​sched_dep_time, ²​dep_delay
flights %>% relocate(where(is.character), .after = day)
# A tibble: 336,776 × 19
    year month   day carrier tailnum origin dest  dep_time sched_dep_t…¹ dep_d…²
   <int> <int> <int> <chr>   <chr>   <chr>  <chr>    <int>         <int>   <dbl>
 1  2013     1     1 UA      N14228  EWR    IAH        517           515       2
 2  2013     1     1 UA      N24211  LGA    IAH        533           529       4
 3  2013     1     1 AA      N619AA  JFK    MIA        542           540       2
 4  2013     1     1 B6      N804JB  JFK    BQN        544           545      -1
 5  2013     1     1 DL      N668DN  LGA    ATL        554           600      -6
 6  2013     1     1 UA      N39463  EWR    ORD        554           558      -4
 7  2013     1     1 B6      N516JB  EWR    FLL        555           600      -5
 8  2013     1     1 EV      N829AS  LGA    IAD        557           600      -3
 9  2013     1     1 B6      N593JB  JFK    MCO        557           600      -3
10  2013     1     1 AA      N3ALAA  LGA    ORD        558           600      -2
# … with 336,766 more rows, 9 more variables: arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, flight <int>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#   variable names ¹​sched_dep_time, ²​dep_delay

We can also use the rename() function to change the function names:

flights %>% rename(tail_num = tailnum)
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tail_num <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

We could technically use select() for renamig as well:

flights %>% select(departure_delay = dep_delay)
# A tibble: 336,776 × 1
   departure_delay
             <dbl>
 1               2
 2               4
 3               2
 4              -1
 5              -6
 6              -4
 7              -5
 8              -3
 9              -3
10              -2
# … with 336,766 more rows

But keep in mind that select() only returns the columns passed as arguments, while rename() returns all.

Mutate

flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
# A tibble: 336,776 × 9
    year month   day dep_delay arr_delay distance air_time  gain speed
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
 1  2013     1     1         2        11     1400      227    -9  370.
 2  2013     1     1         4        20     1416      227   -16  374.
 3  2013     1     1         2        33     1089      160   -31  408.
 4  2013     1     1        -1       -18     1576      183    17  517.
 5  2013     1     1        -6       -25      762      116    19  394.
 6  2013     1     1        -4        12      719      150   -16  288.
 7  2013     1     1        -5        19     1065      158   -24  404.
 8  2013     1     1        -3       -14      229       53    11  259.
 9  2013     1     1        -3        -8      944      140     5  405.
10  2013     1     1        -2         8      733      138   -10  319.
# … with 336,766 more rows

We can use the transmute verb to keep the transformed columns only:

transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
# A tibble: 336,776 × 3
    gain hours gain_per_hour
   <dbl> <dbl>         <dbl>
 1    -9 3.78          -2.38
 2   -16 3.78          -4.23
 3   -31 2.67         -11.6 
 4    17 3.05           5.57
 5    19 1.93           9.83
 6   -16 2.5           -6.4 
 7   -24 2.63          -9.11
 8    11 0.883         12.5 
 9     5 2.33           2.14
10   -10 2.3           -4.35
# … with 336,766 more rows

Group-bys and summaries

We can aggregate the data to get an overview with the summarise verb:

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 × 1
  delay
  <dbl>
1  12.6

But it becomes more useful when combined with group_by:

flights %>%
  group_by(year, month, day) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE))
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 365 × 4
# Groups:   year, month [12]
    year month   day delay
   <int> <int> <int> <dbl>
 1  2013     1     1 11.5 
 2  2013     1     2 13.9 
 3  2013     1     3 11.0 
 4  2013     1     4  8.95
 5  2013     1     5  5.73
 6  2013     1     6  7.15
 7  2013     1     7  5.42
 8  2013     1     8  2.55
 9  2013     1     9  2.28
10  2013     1    10  2.84
# … with 355 more rows

Counts

flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
   group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay)
  ) %>%
  ggplot(mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)

Some planes that have an average delay of 5 hours (300 minutes)!

The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay:

flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
   group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  ) %>%
  ggplot(mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10) + 
  coord_flip()

Why does this happen?

When doing these kind of visualisations, it is always better to filter the smallest numbers:

flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
   group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  ) %>%
  filter(n > 25) %>% # <-- NOTE THIS
  ggplot(mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10) +
  coord_flip()

Appendix: Column-wise operations

starwars %>% 
  summarise(
    species = n_distinct(species),
    vehicles = n_distinct(vehicles),
    starships = n_distinct(starships)
  )
# A tibble: 1 × 3
  species vehicles starships
    <int>    <int>     <int>
1      38       11        17

We could use a much more concise syntax:

starwars %>%
  summarise(across(where(is.character), n_distinct))
# A tibble: 1 × 8
   name hair_color skin_color eye_color   sex gender homeworld species
  <int>      <int>      <int>     <int> <int>  <int>     <int>   <int>
1    87         13         31        15     5      3        49      38

And we can chain multiple summaries:

starwars %>% summarise(
  across(where(is.numeric), mean), # but maybe we want to change this
  across(where(is.character), n_distinct),
  across(where(is.list), n_distinct),
  )
# A tibble: 1 × 14
  height  mass birth_year  name hair_color skin_c…¹ eye_c…²   sex gender homew…³
   <dbl> <dbl>      <dbl> <int>      <int>    <int>   <int> <int>  <int>   <int>
1     NA    NA         NA    87         13       31      15     5      3      49
# … with 4 more variables: species <int>, films <int>, vehicles <int>,
#   starships <int>, and abbreviated variable names ¹​skin_color, ²​eye_color,
#   ³​homeworld

In this way, however, we will still get the undesirable default behaviour of mean() (which returns NA if there are NA values in the sample). To pass an argument to a function inside across(), we can use the ~ formula notation.

starwars %>% summarise(
  across(where(is.numeric), ~ min(.x, na.rm = TRUE)),
  across(where(is.character), n_distinct),
  across(where(is.list), n_distinct),
  )
# A tibble: 1 × 14
  height  mass birth_year  name hair_color skin_c…¹ eye_c…²   sex gender homew…³
   <int> <dbl>      <dbl> <int>      <int>    <int>   <int> <int>  <int>   <int>
1     66    15          8    87         13       31      15     5      3      49
# … with 4 more variables: species <int>, films <int>, vehicles <int>,
#   starships <int>, and abbreviated variable names ¹​skin_color, ²​eye_color,
#   ³​homeworld

Here, .x denotes a placeholder for the data. These pipelines are usually more useful after a groupby operation:

starwars %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
# A tibble: 10 × 4
   homeworld height  mass birth_year
   <chr>      <dbl> <dbl>      <dbl>
 1 Alderaan    176.  64         43  
 2 Corellia    175   78.5       25  
 3 Coruscant   174.  50         91  
 4 Kamino      208.  83.1       31.5
 5 Kashyyyk    231  124        200  
 6 Mirial      168   53.1       49  
 7 Naboo       175.  64.2       55  
 8 Ryloth      179   55         48  
 9 Tatooine    170.  85.4       54.6
10 <NA>        139.  82        334. 

We can also pass a list of functions:

starwars %>% summarise(
  across(
    where(is.numeric),
    list(
      min = ~ min(.x, na.rm = TRUE),
      max = ~ max(.x, na.rm = TRUE)
      )
    ),
  )
# A tibble: 1 × 6
  height_min height_max mass_min mass_max birth_year_min birth_year_max
       <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
1         66        264       15     1358              8            896

And add custom naming conventions with {glue}:

starwars %>% summarise(
  across(
    where(is.numeric),
    list(
      min = ~ min(.x, na.rm = TRUE),
      max = ~ max(.x, na.rm = TRUE)
      ),
    .names = "{.fn}_{.col}" # or "{.col}_{.fn}"
    ),
  )
# A tibble: 1 × 6
  min_height max_height min_mass max_mass min_birth_year max_birth_year
       <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
1         66        264       15     1358              8            896

Or relocate the columns afterwards:

starwars %>% summarise(
  across(
    where(is.numeric),
    .fns = list(
      min = ~ min(.x, na.rm = TRUE),
      max = ~ max(.x, na.rm = TRUE)
    ),
    .names = "{.fn}_{.col}"
  ),
) %>% 
  relocate(starts_with("min"), .before = 1)
# A tibble: 1 × 6
  min_height min_mass min_birth_year max_height max_mass max_birth_year
       <int>    <dbl>          <dbl>      <int>    <dbl>          <dbl>
1         66       15              8        264     1358            896

Of course, the list of functions can be stored as a variable outside the pipeline.

functions_list <- list(
      min = ~ min(.x, na.rm = TRUE),
      max = ~ max(.x, na.rm = TRUE)
    )

starwars %>% summarise(
  across(
    where(is.numeric),
    .fns = functions_list,
    .names = "{.fn}_{.col}"
  ),
) %>% 
  relocate(starts_with("min"), .before = 1)
# A tibble: 1 × 6
  min_height min_mass min_birth_year max_height max_mass max_birth_year
       <int>    <dbl>          <dbl>      <int>    <dbl>          <dbl>
1         66       15              8        264     1358            896

As a side note, count() is a wrapper around . %>% groupby %>% summarise(.x = count(.x))

flights %>% 
  count(year, month, day, origin)
# A tibble: 1,095 × 5
    year month   day origin     n
   <int> <int> <int> <chr>  <int>
 1  2013     1     1 EWR      305
 2  2013     1     1 JFK      297
 3  2013     1     1 LGA      240
 4  2013     1     2 EWR      350
 5  2013     1     2 JFK      321
 6  2013     1     2 LGA      272
 7  2013     1     3 EWR      336
 8  2013     1     3 JFK      318
 9  2013     1     3 LGA      260
10  2013     1     4 EWR      339
# … with 1,085 more rows

Appendix: A {tidyr} detour

Some of the more common data operations require dropping, imputing or replacing missing values. Some other times you might want to separate or unite two columns (e.g. extract the prefix from a phone number), or expand a dataframe to include all possible combinations of values. The functions to perform these operations are built in {tidyr}, which is shipped by default in the {tidyverse}.

Extracting from columns

The functions extract() separate() and unite() are in charge of this behaviour. By default, they split using any non-alphanumeric value:

tibble(x = c(NA, "x.y", "x.z", "y.z")) %>% 
  extract(x, "A")
# A tibble: 4 × 1
  A    
  <chr>
1 <NA> 
2 x    
3 x    
4 y    
tibble(x = c(NA, "x.y", "x.z", "y.z")) %>% 
  extract(x, c("A", "B"), "([a-z]+).([a-z]+)")
# A tibble: 4 × 2
  A     B    
  <chr> <chr>
1 <NA>  <NA> 
2 x     y    
3 x     z    
4 y     z    

These are useful, but require knowing a bit of regex patterns.

The separate() function is a nice wrapper around extract():

tibble(x = c(NA, "x.y", "x.z", "y.z")) %>% 
  separate(x, c("A", "B"))
# A tibble: 4 × 2
  A     B    
  <chr> <chr>
1 <NA>  <NA> 
2 x     y    
3 x     z    
4 y     z    

And has a lot of advanced use cases. separate_rows() creates a new row for each observation:

tbl <- tibble(
  x = 1:3,
  y = c("a", "d,e,f", "g,h"),
  z = c("1", "2,3,4", "5,6")
) 

tbl
# A tibble: 3 × 3
      x y     z    
  <int> <chr> <chr>
1     1 a     1    
2     2 d,e,f 2,3,4
3     3 g,h   5,6  
tbl %>% 
  separate_rows(y, z, convert = TRUE) #convert will perform type conversion
# A tibble: 6 × 3
      x y         z
  <int> <chr> <int>
1     1 a         1
2     2 d         2
3     2 e         3
4     2 f         4
5     3 g         5
6     3 h         6

Uniting column has some use cases:

flights %>% unite(date, year, month, day) # the first column is the new colname
# A tibble: 336,776 × 17
   date   dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum
   <chr>    <int>   <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>  
 1 2013_…     517     515       2     830     819      11 UA        1545 N14228 
 2 2013_…     533     529       4     850     830      20 UA        1714 N24211 
 3 2013_…     542     540       2     923     850      33 AA        1141 N619AA 
 4 2013_…     544     545      -1    1004    1022     -18 B6         725 N804JB 
 5 2013_…     554     600      -6     812     837     -25 DL         461 N668DN 
 6 2013_…     554     558      -4     740     728      12 UA        1696 N39463 
 7 2013_…     555     600      -5     913     854      19 B6         507 N516JB 
 8 2013_…     557     600      -3     709     723     -14 EV        5708 N829AS 
 9 2013_…     557     600      -3     838     846      -8 B6          79 N593JB 
10 2013_…     558     600      -2     753     745       8 AA         301 N3ALAA 
# … with 336,766 more rows, 7 more variables: origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#   and abbreviated variable names ¹​dep_time, ²​sched_dep_time, ³​dep_delay,
#   ⁴​arr_time, ⁵​sched_arr_time, ⁶​arr_delay

But when working with dates it might be better to use dedicated packages (more on this below):

flights %>% transmute(date = lubridate::make_date(year, month, day))
# A tibble: 336,776 × 1
   date      
   <date>    
 1 2013-01-01
 2 2013-01-01
 3 2013-01-01
 4 2013-01-01
 5 2013-01-01
 6 2013-01-01
 7 2013-01-01
 8 2013-01-01
 9 2013-01-01
10 2013-01-01
# … with 336,766 more rows
flights %>% transmute(date = clock::date_build(year, month, day))
# A tibble: 336,776 × 1
   date      
   <date>    
 1 2013-01-01
 2 2013-01-01
 3 2013-01-01
 4 2013-01-01
 5 2013-01-01
 6 2013-01-01
 7 2013-01-01
 8 2013-01-01
 9 2013-01-01
10 2013-01-01
# … with 336,766 more rows

This is an advanced case of completing sequences with datetime objects. The base case is:

full_seq(c(1, 10), 1)
 [1]  1  2  3  4  5  6  7  8  9 10

Combine values

fruits <- tibble(
  type   = c("apple", "mandarin", "apple", "mandarin", "mandarin", "mandarin"),
  color = c("red", "orange", "red", "orange", "orange", "orange"),
  year   = c(2010, 2010, 2012, 2010, 2011, 2012),
  size  =  factor(
    c("XS", "S",  "M", "S", "S", "M"),
    levels = c("XS", "S", "M", "L")
  ),
  weights = rnorm(6, as.numeric(size) + 2)
)

fruits
# A tibble: 6 × 5
  type     color   year size  weights
  <chr>    <chr>  <dbl> <fct>   <dbl>
1 apple    red     2010 XS       2.25
2 mandarin orange  2010 S        5.12
3 apple    red     2012 M        4.71
4 mandarin orange  2010 S        2.17
5 mandarin orange  2011 S        4.64
6 mandarin orange  2012 M        5.43
fruits %>% select(type, size)
# A tibble: 6 × 2
  type     size 
  <chr>    <fct>
1 apple    XS   
2 mandarin S    
3 apple    M    
4 mandarin S    
5 mandarin S    
6 mandarin M    

Note that the levels of size are XS, S, M, L and that L does not appear in the data.

If we expand the type and size columns, all combinations of all levels will appear.

fruits %>% expand(type, size)
# A tibble: 8 × 2
  type     size 
  <chr>    <fct>
1 apple    XS   
2 apple    S    
3 apple    M    
4 apple    L    
5 mandarin XS   
6 mandarin S    
7 mandarin M    
8 mandarin L    

We use nesting() to perform the expansion only with the values that appear (i.e., the missing levels won’t be used!):

fruits %>% expand(type, nesting(size))
# A tibble: 6 × 2
  type     size 
  <chr>    <fct>
1 apple    XS   
2 apple    S    
3 apple    M    
4 mandarin XS   
5 mandarin S    
6 mandarin M    

Expand and complete grids

The {clock} and {lubridate} packages have functions to create datetime ranges:

start = as.Date("2010-01-01")
end = as.Date("2010-01-31")
  
clock::date_seq(from = start, to = end, by=1)
 [1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05"
 [6] "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
[11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15"
[16] "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
[21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25"
[26] "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"
[31] "2010-01-31"
clock::date_seq(from = start, by = 1, total_size = 31)
 [1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05"
 [6] "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
[11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15"
[16] "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
[21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25"
[26] "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"
[31] "2010-01-31"

We can use these to create grids:

expand_grid(
  date = clock::date_seq(from = start, by = 1, total_size = 31),
  from = c("Duomo", "Centrale FS", "Garibaldi FS"),
)
# A tibble: 93 × 2
   date       from        
   <date>     <chr>       
 1 2010-01-01 Duomo       
 2 2010-01-01 Centrale FS 
 3 2010-01-01 Garibaldi FS
 4 2010-01-02 Duomo       
 5 2010-01-02 Centrale FS 
 6 2010-01-02 Garibaldi FS
 7 2010-01-03 Duomo       
 8 2010-01-03 Centrale FS 
 9 2010-01-03 Garibaldi FS
10 2010-01-04 Duomo       
# … with 83 more rows

The functions above are useful, but to use them with a dataset we need to use some functions to join and possibly replace missing values. Fortunately for us, the complete is a wrapper around expand and other join functions to simplify the interface:

fruits %>% complete(type, size, year)
# A tibble: 25 × 5
   type  size   year color weights
   <chr> <fct> <dbl> <chr>   <dbl>
 1 apple XS     2010 red      2.25
 2 apple XS     2011 <NA>    NA   
 3 apple XS     2012 <NA>    NA   
 4 apple S      2010 <NA>    NA   
 5 apple S      2011 <NA>    NA   
 6 apple S      2012 <NA>    NA   
 7 apple M      2010 <NA>    NA   
 8 apple M      2011 <NA>    NA   
 9 apple M      2012 red      4.71
10 apple L      2010 <NA>    NA   
# … with 15 more rows

But we can also use a group-by, which will return the full table (with missing values).

fruits %>% group_by(type) %>% complete(size, year)
# A tibble: 21 × 5
# Groups:   type [2]
   type     size   year color weights
   <chr>    <fct> <dbl> <chr>   <dbl>
 1 apple    XS     2010 red      2.25
 2 apple    XS     2012 <NA>    NA   
 3 apple    S      2010 <NA>    NA   
 4 apple    S      2012 <NA>    NA   
 5 apple    M      2010 <NA>    NA   
 6 apple    M      2012 red      4.71
 7 apple    L      2010 <NA>    NA   
 8 apple    L      2012 <NA>    NA   
 9 mandarin XS     2010 <NA>    NA   
10 mandarin XS     2011 <NA>    NA   
# … with 11 more rows

To fill the NAs, we can use an ifelse or case_when statement for qualitative/categorical variables:

fruits %>%
  group_by(type) %>%
  complete(size, year) %>%
  mutate(
    color = ifelse(type == "apple", "red", "orange"),
  )
# A tibble: 21 × 5
# Groups:   type [2]
   type     size   year color  weights
   <chr>    <fct> <dbl> <chr>    <dbl>
 1 apple    XS     2010 red       2.25
 2 apple    XS     2012 red      NA   
 3 apple    S      2010 red      NA   
 4 apple    S      2012 red      NA   
 5 apple    M      2010 red      NA   
 6 apple    M      2012 red       4.71
 7 apple    L      2010 red      NA   
 8 apple    L      2012 red      NA   
 9 mandarin XS     2010 orange   NA   
10 mandarin XS     2011 orange   NA   
# … with 11 more rows
fruits %>%
  group_by(type) %>%
  complete(size, year) %>%
  arrange(type, year) %>% 
  mutate(
    color = case_when(
      type == "apple" ~ "red",
      TRUE ~ "orange" # this denotes "every other case"
    )
  )
# A tibble: 21 × 5
# Groups:   type [2]
   type     size   year color  weights
   <chr>    <fct> <dbl> <chr>    <dbl>
 1 apple    XS     2010 red       2.25
 2 apple    S      2010 red      NA   
 3 apple    M      2010 red      NA   
 4 apple    L      2010 red      NA   
 5 apple    XS     2012 red      NA   
 6 apple    S      2012 red      NA   
 7 apple    M      2012 red       4.71
 8 apple    L      2012 red      NA   
 9 mandarin XS     2010 orange   NA   
10 mandarin S      2010 orange    5.12
# … with 11 more rows

The case_when statement is longer, but is useful when there are multiple options. Note that the mutate verb was used and replaced the values that were already present. It would be more optimal to use other functions, such as replace_na and fill. In case of quantitative variables, perhaps a case statement matched with a custom function can help.

fruits %>%
  group_by(type) %>%
  complete(size, year) %>%
  arrange(type, year) %>% 
  fill(color, .direction = "updown") # "downup" is equivalent in this case
# A tibble: 21 × 5
# Groups:   type [2]
   type     size   year color  weights
   <chr>    <fct> <dbl> <chr>    <dbl>
 1 apple    XS     2010 red       2.25
 2 apple    S      2010 red      NA   
 3 apple    M      2010 red      NA   
 4 apple    L      2010 red      NA   
 5 apple    XS     2012 red      NA   
 6 apple    S      2012 red      NA   
 7 apple    M      2012 red       4.71
 8 apple    L      2012 red      NA   
 9 mandarin XS     2010 orange   NA   
10 mandarin S      2010 orange    5.12
# … with 11 more rows

Advanced topics

  • How to work with factors using {forcats} and with dates using {lubridate} or {clock}
  • Explore window functions in {dplyr}
  • Functional programming with {purr}