A tibble
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
7 2013 1 1 555 600 -5 913 854 19 B6
8 2013 1 1 557 600 -3 709 723 -14 EV
9 2013 1 1 557 600 -3 838 846 -8 B6
10 2013 1 1 558 600 -2 753 745 8 AA
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
You might also have noticed the row of three (or four) letter abbreviations under the column names. These describe the type of each variable:
int stands for integers.
dbl stands for doubles, or real numbers.
chr stands for character vectors, or strings.
dttm stands for date-times (a date + a time).
There are three other common types of variables that aren’t used in this dataset but you’ll encounter later in the book:
lgl stands for logical, vectors that contain only TRUE or FALSE.
fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
date stands for dates.
First basic {dplyr} verbs
Filter
filter (flights, month == 1 , day == 1 )
# A tibble: 842 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
7 2013 1 1 555 600 -5 913 854 19 B6
8 2013 1 1 557 600 -3 709 723 -14 EV
9 2013 1 1 557 600 -3 838 846 -8 B6
10 2013 1 1 558 600 -2 753 745 8 AA
# … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
For simplicity, we can use the %>% (pipe) operator:
flights %>% filter (month == 11 | month == 12 )
# A tibble: 55,403 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 11 1 5 2359 6 352 345 7 B6
2 2013 11 1 35 2250 105 123 2356 87 B6
3 2013 11 1 455 500 -5 641 651 -10 US
4 2013 11 1 539 545 -6 856 827 29 UA
5 2013 11 1 542 545 -3 831 855 -24 AA
6 2013 11 1 549 600 -11 912 923 -11 UA
7 2013 11 1 550 600 -10 705 659 6 US
8 2013 11 1 554 600 -6 659 701 -2 US
9 2013 11 1 554 600 -6 826 827 -1 DL
10 2013 11 1 554 600 -6 749 751 -2 DL
# … with 55,393 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>% filter (month %in% c (11 , 12 ))
# A tibble: 55,403 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 11 1 5 2359 6 352 345 7 B6
2 2013 11 1 35 2250 105 123 2356 87 B6
3 2013 11 1 455 500 -5 641 651 -10 US
4 2013 11 1 539 545 -6 856 827 29 UA
5 2013 11 1 542 545 -3 831 855 -24 AA
6 2013 11 1 549 600 -11 912 923 -11 UA
7 2013 11 1 550 600 -10 705 659 6 US
8 2013 11 1 554 600 -6 659 701 -2 US
9 2013 11 1 554 600 -6 826 827 -1 DL
10 2013 11 1 554 600 -6 749 751 -2 DL
# … with 55,393 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>% filter (! is.na (dep_time))
# A tibble: 328,521 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
7 2013 1 1 555 600 -5 913 854 19 B6
8 2013 1 1 557 600 -3 709 723 -14 EV
9 2013 1 1 557 600 -3 838 846 -8 B6
10 2013 1 1 558 600 -2 753 745 8 AA
# … with 328,511 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
To remove all missing values, we can use the following:
flights %>% tidyr:: drop_na ()
# A tibble: 327,346 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
7 2013 1 1 555 600 -5 913 854 19 B6
8 2013 1 1 557 600 -3 709 723 -14 EV
9 2013 1 1 557 600 -3 838 846 -8 B6
10 2013 1 1 558 600 -2 753 745 8 AA
# … with 327,336 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Arrange
flights %>% arrange (year, month, day)
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
7 2013 1 1 555 600 -5 913 854 19 B6
8 2013 1 1 557 600 -3 709 723 -14 EV
9 2013 1 1 557 600 -3 838 846 -8 B6
10 2013 1 1 558 600 -2 753 745 8 AA
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>% arrange (desc (dep_delay))
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 9 641 900 1301 1242 1530 1272 HA
2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
5 2013 7 22 845 1600 1005 1044 1815 989 MQ
6 2013 4 10 1100 1900 960 1342 2211 931 DL
7 2013 3 17 2321 810 911 135 1020 915 DL
8 2013 6 27 959 1900 899 1236 2226 850 DL
9 2013 7 22 2257 759 898 121 1026 895 DL
10 2013 12 5 756 1700 896 1058 2020 878 AA
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Note that missing values are always sorted at the end.
Select
flights %>% select (year, month, day)
# A tibble: 336,776 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# … with 336,766 more rows
We can specify a range of columns like this:
flights %>% select (year: day)
# A tibble: 336,776 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# … with 336,766 more rows
# A tibble: 336,776 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# … with 336,766 more rows
Or exclude a range in a similar way:
flights %>% select (- (year: day))
# A tibble: 336,776 × 16
dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
<int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
1 517 515 2 830 819 11 UA 1545 N14228 EWR
2 533 529 4 850 830 20 UA 1714 N24211 LGA
3 542 540 2 923 850 33 AA 1141 N619AA JFK
4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
5 554 600 -6 812 837 -25 DL 461 N668DN LGA
6 554 558 -4 740 728 12 UA 1696 N39463 EWR
7 555 600 -5 913 854 19 B6 507 N516JB EWR
8 557 600 -3 709 723 -14 EV 5708 N829AS LGA
9 557 600 -3 838 846 -8 B6 79 N593JB JFK
10 558 600 -2 753 745 8 AA 301 N3ALAA LGA
# … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
# variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
# ⁵sched_arr_time, ⁶arr_delay
flights %>% select (- (year: day), - (5 : 7 ))
# A tibble: 336,776 × 13
dep_time sched_…¹ arr_d…² carrier flight tailnum origin dest air_t…³ dista…⁴
<int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl>
1 517 819 11 UA 1545 N14228 EWR IAH 227 1400
2 533 830 20 UA 1714 N24211 LGA IAH 227 1416
3 542 850 33 AA 1141 N619AA JFK MIA 160 1089
4 544 1022 -18 B6 725 N804JB JFK BQN 183 1576
5 554 837 -25 DL 461 N668DN LGA ATL 116 762
6 554 728 12 UA 1696 N39463 EWR ORD 150 719
7 555 854 19 B6 507 N516JB EWR FLL 158 1065
8 557 723 -14 EV 5708 N829AS LGA IAD 53 229
9 557 846 -8 B6 79 N593JB JFK MCO 140 944
10 558 745 8 AA 301 N3ALAA LGA ORD 138 733
# … with 336,766 more rows, 3 more variables: hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names ¹sched_arr_time,
# ²arr_delay, ³air_time, ⁴distance
We can use a plethora of verbs inside select:
starts_with("abc"): matches names that begin with “abc”. ends_with("xyz"): matches names that end with “xyz”. contains("ijk"): matches names that contain “ijk”. matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings. num_range("x", 1:3): matches x1, x2 and x3.
flights %>% select (starts_with ("dep" ))
# A tibble: 336,776 × 2
dep_time dep_delay
<int> <dbl>
1 517 2
2 533 4
3 542 2
4 544 -1
5 554 -6
6 554 -4
7 555 -5
8 557 -3
9 557 -3
10 558 -2
# … with 336,766 more rows
flights %>% select (ends_with ("delay" ))
# A tibble: 336,776 × 2
dep_delay arr_delay
<dbl> <dbl>
1 2 11
2 4 20
3 2 33
4 -1 -18
5 -6 -25
6 -4 12
7 -5 19
8 -3 -14
9 -3 -8
10 -2 8
# … with 336,766 more rows
flights %>% select (contains ("delay" ))
# A tibble: 336,776 × 2
dep_delay arr_delay
<dbl> <dbl>
1 2 11
2 4 20
3 2 33
4 -1 -18
5 -6 -25
6 -4 12
7 -5 19
8 -3 -14
9 -3 -8
10 -2 8
# … with 336,766 more rows
flights %>% select (contains (c ("delay" , "dep" , "arr" )))
# A tibble: 336,776 × 7
dep_delay arr_delay dep_time sched_dep_time arr_time sched_arr_time carrier
<dbl> <dbl> <int> <int> <int> <int> <chr>
1 2 11 517 515 830 819 UA
2 4 20 533 529 850 830 UA
3 2 33 542 540 923 850 AA
4 -1 -18 544 545 1004 1022 B6
5 -6 -25 554 600 812 837 DL
6 -4 12 554 558 740 728 UA
7 -5 19 555 600 913 854 B6
8 -3 -14 557 600 709 723 EV
9 -3 -8 557 600 838 846 B6
10 -2 8 558 600 753 745 AA
# … with 336,766 more rows
But we can also use the where() verb:
flights %>% select (where (is.numeric))
# A tibble: 336,776 × 14
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830 819 11 1545
2 2013 1 1 533 529 4 850 830 20 1714
3 2013 1 1 542 540 2 923 850 33 1141
4 2013 1 1 544 545 -1 1004 1022 -18 725
5 2013 1 1 554 600 -6 812 837 -25 461
6 2013 1 1 554 558 -4 740 728 12 1696
7 2013 1 1 555 600 -5 913 854 19 507
8 2013 1 1 557 600 -3 709 723 -14 5708
9 2013 1 1 557 600 -3 838 846 -8 79
10 2013 1 1 558 600 -2 753 745 8 301
# … with 336,766 more rows, 4 more variables: air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, and abbreviated variable names ¹sched_dep_time,
# ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
We can use this to rearrange the columns in the data:
flights %>% select (where (is.numeric), where (is.character))
# A tibble: 336,776 × 18
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830 819 11 1545
2 2013 1 1 533 529 4 850 830 20 1714
3 2013 1 1 542 540 2 923 850 33 1141
4 2013 1 1 544 545 -1 1004 1022 -18 725
5 2013 1 1 554 600 -6 812 837 -25 461
6 2013 1 1 554 558 -4 740 728 12 1696
7 2013 1 1 555 600 -5 913 854 19 507
8 2013 1 1 557 600 -3 709 723 -14 5708
9 2013 1 1 557 600 -3 838 846 -8 79
10 2013 1 1 558 600 -2 753 745 8 301
# … with 336,766 more rows, 8 more variables: air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, carrier <chr>, tailnum <chr>, origin <chr>,
# dest <chr>, and abbreviated variable names ¹sched_dep_time, ²dep_delay,
# ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Or we can use the everything() function to denote all other columns.
flights %>% select (where (is.numeric), everything ())
# A tibble: 336,776 × 19
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830 819 11 1545
2 2013 1 1 533 529 4 850 830 20 1714
3 2013 1 1 542 540 2 923 850 33 1141
4 2013 1 1 544 545 -1 1004 1022 -18 725
5 2013 1 1 554 600 -6 812 837 -25 461
6 2013 1 1 554 558 -4 740 728 12 1696
7 2013 1 1 555 600 -5 913 854 19 507
8 2013 1 1 557 600 -3 709 723 -14 5708
9 2013 1 1 557 600 -3 838 846 -8 79
10 2013 1 1 558 600 -2 753 745 8 301
# … with 336,766 more rows, 9 more variables: air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, carrier <chr>, tailnum <chr>, origin <chr>,
# dest <chr>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
The where clause allows us to be more expressive:
flights %>% select (where (~ is.numeric (.x)))
# A tibble: 336,776 × 14
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830 819 11 1545
2 2013 1 1 533 529 4 850 830 20 1714
3 2013 1 1 542 540 2 923 850 33 1141
4 2013 1 1 544 545 -1 1004 1022 -18 725
5 2013 1 1 554 600 -6 812 837 -25 461
6 2013 1 1 554 558 -4 740 728 12 1696
7 2013 1 1 555 600 -5 913 854 19 507
8 2013 1 1 557 600 -3 709 723 -14 5708
9 2013 1 1 557 600 -3 838 846 -8 79
10 2013 1 1 558 600 -2 753 745 8 301
# … with 336,766 more rows, 4 more variables: air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, and abbreviated variable names ¹sched_dep_time,
# ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Reordering and renaming columns
For reordering columns, however, there is a dedicated function:
flights %>% relocate (where (is.character), .before = 1 )
# A tibble: 336,776 × 19
carrier tailnum origin dest year month day dep_time sched_dep_t…¹ dep_d…²
<chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <dbl>
1 UA N14228 EWR IAH 2013 1 1 517 515 2
2 UA N24211 LGA IAH 2013 1 1 533 529 4
3 AA N619AA JFK MIA 2013 1 1 542 540 2
4 B6 N804JB JFK BQN 2013 1 1 544 545 -1
5 DL N668DN LGA ATL 2013 1 1 554 600 -6
6 UA N39463 EWR ORD 2013 1 1 554 558 -4
7 B6 N516JB EWR FLL 2013 1 1 555 600 -5
8 EV N829AS LGA IAD 2013 1 1 557 600 -3
9 B6 N593JB JFK MCO 2013 1 1 557 600 -3
10 AA N3ALAA LGA ORD 2013 1 1 558 600 -2
# … with 336,766 more rows, 9 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, flight <int>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
# variable names ¹sched_dep_time, ²dep_delay
flights %>% relocate (where (is.character), .before = year)
# A tibble: 336,776 × 19
carrier tailnum origin dest year month day dep_time sched_dep_t…¹ dep_d…²
<chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <dbl>
1 UA N14228 EWR IAH 2013 1 1 517 515 2
2 UA N24211 LGA IAH 2013 1 1 533 529 4
3 AA N619AA JFK MIA 2013 1 1 542 540 2
4 B6 N804JB JFK BQN 2013 1 1 544 545 -1
5 DL N668DN LGA ATL 2013 1 1 554 600 -6
6 UA N39463 EWR ORD 2013 1 1 554 558 -4
7 B6 N516JB EWR FLL 2013 1 1 555 600 -5
8 EV N829AS LGA IAD 2013 1 1 557 600 -3
9 B6 N593JB JFK MCO 2013 1 1 557 600 -3
10 AA N3ALAA LGA ORD 2013 1 1 558 600 -2
# … with 336,766 more rows, 9 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, flight <int>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
# variable names ¹sched_dep_time, ²dep_delay
flights %>% relocate (where (is.character), .after = 3 )
# A tibble: 336,776 × 19
year month day carrier tailnum origin dest dep_time sched_dep_t…¹ dep_d…²
<int> <int> <int> <chr> <chr> <chr> <chr> <int> <int> <dbl>
1 2013 1 1 UA N14228 EWR IAH 517 515 2
2 2013 1 1 UA N24211 LGA IAH 533 529 4
3 2013 1 1 AA N619AA JFK MIA 542 540 2
4 2013 1 1 B6 N804JB JFK BQN 544 545 -1
5 2013 1 1 DL N668DN LGA ATL 554 600 -6
6 2013 1 1 UA N39463 EWR ORD 554 558 -4
7 2013 1 1 B6 N516JB EWR FLL 555 600 -5
8 2013 1 1 EV N829AS LGA IAD 557 600 -3
9 2013 1 1 B6 N593JB JFK MCO 557 600 -3
10 2013 1 1 AA N3ALAA LGA ORD 558 600 -2
# … with 336,766 more rows, 9 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, flight <int>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
# variable names ¹sched_dep_time, ²dep_delay
flights %>% relocate (where (is.character), .after = day)
# A tibble: 336,776 × 19
year month day carrier tailnum origin dest dep_time sched_dep_t…¹ dep_d…²
<int> <int> <int> <chr> <chr> <chr> <chr> <int> <int> <dbl>
1 2013 1 1 UA N14228 EWR IAH 517 515 2
2 2013 1 1 UA N24211 LGA IAH 533 529 4
3 2013 1 1 AA N619AA JFK MIA 542 540 2
4 2013 1 1 B6 N804JB JFK BQN 544 545 -1
5 2013 1 1 DL N668DN LGA ATL 554 600 -6
6 2013 1 1 UA N39463 EWR ORD 554 558 -4
7 2013 1 1 B6 N516JB EWR FLL 555 600 -5
8 2013 1 1 EV N829AS LGA IAD 557 600 -3
9 2013 1 1 B6 N593JB JFK MCO 557 600 -3
10 2013 1 1 AA N3ALAA LGA ORD 558 600 -2
# … with 336,766 more rows, 9 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, flight <int>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
# variable names ¹sched_dep_time, ²dep_delay
We can also use the rename() function to change the function names:
flights %>% rename (tail_num = tailnum)
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
7 2013 1 1 555 600 -5 913 854 19 B6
8 2013 1 1 557 600 -3 709 723 -14 EV
9 2013 1 1 557 600 -3 838 846 -8 B6
10 2013 1 1 558 600 -2 753 745 8 AA
# … with 336,766 more rows, 9 more variables: flight <int>, tail_num <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
We could technically use select() for renamig as well:
flights %>% select (departure_delay = dep_delay)
# A tibble: 336,776 × 1
departure_delay
<dbl>
1 2
2 4
3 2
4 -1
5 -6
6 -4
7 -5
8 -3
9 -3
10 -2
# … with 336,766 more rows
But keep in mind that select() only returns the columns passed as arguments, while rename() returns all.
Mutate
flights_sml <- select (flights,
year: day,
ends_with ("delay" ),
distance,
air_time
)
mutate (flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
# A tibble: 336,776 × 9
year month day dep_delay arr_delay distance air_time gain speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 -9 370.
2 2013 1 1 4 20 1416 227 -16 374.
3 2013 1 1 2 33 1089 160 -31 408.
4 2013 1 1 -1 -18 1576 183 17 517.
5 2013 1 1 -6 -25 762 116 19 394.
6 2013 1 1 -4 12 719 150 -16 288.
7 2013 1 1 -5 19 1065 158 -24 404.
8 2013 1 1 -3 -14 229 53 11 259.
9 2013 1 1 -3 -8 944 140 5 405.
10 2013 1 1 -2 8 733 138 -10 319.
# … with 336,766 more rows
We can use the transmute verb to keep the transformed columns only:
transmute (flights,
gain = dep_delay - arr_delay,
hours = air_time / 60 ,
gain_per_hour = gain / hours
)
# A tibble: 336,776 × 3
gain hours gain_per_hour
<dbl> <dbl> <dbl>
1 -9 3.78 -2.38
2 -16 3.78 -4.23
3 -31 2.67 -11.6
4 17 3.05 5.57
5 19 1.93 9.83
6 -16 2.5 -6.4
7 -24 2.63 -9.11
8 11 0.883 12.5
9 5 2.33 2.14
10 -10 2.3 -4.35
# … with 336,766 more rows
Group-bys and summaries
We can aggregate the data to get an overview with the summarise verb:
summarise (flights, delay = mean (dep_delay, na.rm = TRUE ))
# A tibble: 1 × 1
delay
<dbl>
1 12.6
But it becomes more useful when combined with group_by:
flights %>%
group_by (year, month, day) %>%
summarise (delay = mean (dep_delay, na.rm = TRUE ))
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 365 × 4
# Groups: year, month [12]
year month day delay
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# … with 355 more rows
Counts
flights %>%
filter (! is.na (dep_delay), ! is.na (arr_delay)) %>%
group_by (tailnum) %>%
summarise (
delay = mean (arr_delay)
) %>%
ggplot (mapping = aes (x = delay)) +
geom_freqpoly (binwidth = 10 )
Some planes that have an average delay of 5 hours (300 minutes)!
The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay:
flights %>%
filter (! is.na (dep_delay), ! is.na (arr_delay)) %>%
group_by (tailnum) %>%
summarise (
delay = mean (arr_delay, na.rm = TRUE ),
n = n ()
) %>%
ggplot (mapping = aes (x = n, y = delay)) +
geom_point (alpha = 1 / 10 ) +
coord_flip ()
Why does this happen?
When doing these kind of visualisations, it is always better to filter the smallest numbers:
flights %>%
filter (! is.na (dep_delay), ! is.na (arr_delay)) %>%
group_by (tailnum) %>%
summarise (
delay = mean (arr_delay, na.rm = TRUE ),
n = n ()
) %>%
filter (n > 25 ) %>% # <-- NOTE THIS
ggplot (mapping = aes (x = n, y = delay)) +
geom_point (alpha = 1 / 10 ) +
coord_flip ()
Appendix: Column-wise operations
starwars %>%
summarise (
species = n_distinct (species),
vehicles = n_distinct (vehicles),
starships = n_distinct (starships)
)
# A tibble: 1 × 3
species vehicles starships
<int> <int> <int>
1 38 11 17
We could use a much more concise syntax:
starwars %>%
summarise (across (where (is.character), n_distinct))
# A tibble: 1 × 8
name hair_color skin_color eye_color sex gender homeworld species
<int> <int> <int> <int> <int> <int> <int> <int>
1 87 13 31 15 5 3 49 38
And we can chain multiple summaries:
starwars %>% summarise (
across (where (is.numeric), mean), # but maybe we want to change this
across (where (is.character), n_distinct),
across (where (is.list), n_distinct),
)
# A tibble: 1 × 14
height mass birth_year name hair_color skin_c…¹ eye_c…² sex gender homew…³
<dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 NA NA NA 87 13 31 15 5 3 49
# … with 4 more variables: species <int>, films <int>, vehicles <int>,
# starships <int>, and abbreviated variable names ¹skin_color, ²eye_color,
# ³homeworld
In this way, however, we will still get the undesirable default behaviour of mean() (which returns NA if there are NA values in the sample). To pass an argument to a function inside across(), we can use the ~ formula notation.
starwars %>% summarise (
across (where (is.numeric), ~ min (.x, na.rm = TRUE )),
across (where (is.character), n_distinct),
across (where (is.list), n_distinct),
)
# A tibble: 1 × 14
height mass birth_year name hair_color skin_c…¹ eye_c…² sex gender homew…³
<int> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 66 15 8 87 13 31 15 5 3 49
# … with 4 more variables: species <int>, films <int>, vehicles <int>,
# starships <int>, and abbreviated variable names ¹skin_color, ²eye_color,
# ³homeworld
Here, .x denotes a placeholder for the data. These pipelines are usually more useful after a groupby operation:
starwars %>%
group_by (homeworld) %>%
filter (n () > 1 ) %>%
summarise (across (where (is.numeric), ~ mean (.x, na.rm = TRUE )))
# A tibble: 10 × 4
homeworld height mass birth_year
<chr> <dbl> <dbl> <dbl>
1 Alderaan 176. 64 43
2 Corellia 175 78.5 25
3 Coruscant 174. 50 91
4 Kamino 208. 83.1 31.5
5 Kashyyyk 231 124 200
6 Mirial 168 53.1 49
7 Naboo 175. 64.2 55
8 Ryloth 179 55 48
9 Tatooine 170. 85.4 54.6
10 <NA> 139. 82 334.
We can also pass a list of functions:
starwars %>% summarise (
across (
where (is.numeric),
list (
min = ~ min (.x, na.rm = TRUE ),
max = ~ max (.x, na.rm = TRUE )
)
),
)
# A tibble: 1 × 6
height_min height_max mass_min mass_max birth_year_min birth_year_max
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 66 264 15 1358 8 896
And add custom naming conventions with {glue}:
starwars %>% summarise (
across (
where (is.numeric),
list (
min = ~ min (.x, na.rm = TRUE ),
max = ~ max (.x, na.rm = TRUE )
),
.names = "{.fn}_{.col}" # or "{.col}_{.fn}"
),
)
# A tibble: 1 × 6
min_height max_height min_mass max_mass min_birth_year max_birth_year
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 66 264 15 1358 8 896
Or relocate the columns afterwards:
starwars %>% summarise (
across (
where (is.numeric),
.fns = list (
min = ~ min (.x, na.rm = TRUE ),
max = ~ max (.x, na.rm = TRUE )
),
.names = "{.fn}_{.col}"
),
) %>%
relocate (starts_with ("min" ), .before = 1 )
# A tibble: 1 × 6
min_height min_mass min_birth_year max_height max_mass max_birth_year
<int> <dbl> <dbl> <int> <dbl> <dbl>
1 66 15 8 264 1358 896
Of course, the list of functions can be stored as a variable outside the pipeline.
functions_list <- list (
min = ~ min (.x, na.rm = TRUE ),
max = ~ max (.x, na.rm = TRUE )
)
starwars %>% summarise (
across (
where (is.numeric),
.fns = functions_list,
.names = "{.fn}_{.col}"
),
) %>%
relocate (starts_with ("min" ), .before = 1 )
# A tibble: 1 × 6
min_height min_mass min_birth_year max_height max_mass max_birth_year
<int> <dbl> <dbl> <int> <dbl> <dbl>
1 66 15 8 264 1358 896
As a side note, count() is a wrapper around . %>% groupby %>% summarise(.x = count(.x))
flights %>%
count (year, month, day, origin)
# A tibble: 1,095 × 5
year month day origin n
<int> <int> <int> <chr> <int>
1 2013 1 1 EWR 305
2 2013 1 1 JFK 297
3 2013 1 1 LGA 240
4 2013 1 2 EWR 350
5 2013 1 2 JFK 321
6 2013 1 2 LGA 272
7 2013 1 3 EWR 336
8 2013 1 3 JFK 318
9 2013 1 3 LGA 260
10 2013 1 4 EWR 339
# … with 1,085 more rows
Appendix: A {tidyr} detour
Some of the more common data operations require dropping, imputing or replacing missing values. Some other times you might want to separate or unite two columns (e.g. extract the prefix from a phone number), or expand a dataframe to include all possible combinations of values. The functions to perform these operations are built in {tidyr}, which is shipped by default in the {tidyverse}.
Extracting from columns
The functions extract() separate() and unite() are in charge of this behaviour. By default, they split using any non-alphanumeric value:
tibble (x = c (NA , "x.y" , "x.z" , "y.z" )) %>%
extract (x, "A" )
# A tibble: 4 × 1
A
<chr>
1 <NA>
2 x
3 x
4 y
tibble (x = c (NA , "x.y" , "x.z" , "y.z" )) %>%
extract (x, c ("A" , "B" ), "([a-z]+).([a-z]+)" )
# A tibble: 4 × 2
A B
<chr> <chr>
1 <NA> <NA>
2 x y
3 x z
4 y z
These are useful, but require knowing a bit of regex patterns.
The separate() function is a nice wrapper around extract():
tibble (x = c (NA , "x.y" , "x.z" , "y.z" )) %>%
separate (x, c ("A" , "B" ))
# A tibble: 4 × 2
A B
<chr> <chr>
1 <NA> <NA>
2 x y
3 x z
4 y z
And has a lot of advanced use cases. separate_rows() creates a new row for each observation:
tbl <- tibble (
x = 1 : 3 ,
y = c ("a" , "d,e,f" , "g,h" ),
z = c ("1" , "2,3,4" , "5,6" )
)
tbl
# A tibble: 3 × 3
x y z
<int> <chr> <chr>
1 1 a 1
2 2 d,e,f 2,3,4
3 3 g,h 5,6
tbl %>%
separate_rows (y, z, convert = TRUE ) #convert will perform type conversion
# A tibble: 6 × 3
x y z
<int> <chr> <int>
1 1 a 1
2 2 d 2
3 2 e 3
4 2 f 4
5 3 g 5
6 3 h 6
Uniting column has some use cases:
flights %>% unite (date, year, month, day) # the first column is the new colname
# A tibble: 336,776 × 17
date dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum
<chr> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
1 2013_… 517 515 2 830 819 11 UA 1545 N14228
2 2013_… 533 529 4 850 830 20 UA 1714 N24211
3 2013_… 542 540 2 923 850 33 AA 1141 N619AA
4 2013_… 544 545 -1 1004 1022 -18 B6 725 N804JB
5 2013_… 554 600 -6 812 837 -25 DL 461 N668DN
6 2013_… 554 558 -4 740 728 12 UA 1696 N39463
7 2013_… 555 600 -5 913 854 19 B6 507 N516JB
8 2013_… 557 600 -3 709 723 -14 EV 5708 N829AS
9 2013_… 557 600 -3 838 846 -8 B6 79 N593JB
10 2013_… 558 600 -2 753 745 8 AA 301 N3ALAA
# … with 336,766 more rows, 7 more variables: origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
# and abbreviated variable names ¹dep_time, ²sched_dep_time, ³dep_delay,
# ⁴arr_time, ⁵sched_arr_time, ⁶arr_delay
But when working with dates it might be better to use dedicated packages (more on this below):
flights %>% transmute (date = lubridate:: make_date (year, month, day))
# A tibble: 336,776 × 1
date
<date>
1 2013-01-01
2 2013-01-01
3 2013-01-01
4 2013-01-01
5 2013-01-01
6 2013-01-01
7 2013-01-01
8 2013-01-01
9 2013-01-01
10 2013-01-01
# … with 336,766 more rows
flights %>% transmute (date = clock:: date_build (year, month, day))
# A tibble: 336,776 × 1
date
<date>
1 2013-01-01
2 2013-01-01
3 2013-01-01
4 2013-01-01
5 2013-01-01
6 2013-01-01
7 2013-01-01
8 2013-01-01
9 2013-01-01
10 2013-01-01
# … with 336,766 more rows
This is an advanced case of completing sequences with datetime objects. The base case is:
Combine values
fruits <- tibble (
type = c ("apple" , "mandarin" , "apple" , "mandarin" , "mandarin" , "mandarin" ),
color = c ("red" , "orange" , "red" , "orange" , "orange" , "orange" ),
year = c (2010 , 2010 , 2012 , 2010 , 2011 , 2012 ),
size = factor (
c ("XS" , "S" , "M" , "S" , "S" , "M" ),
levels = c ("XS" , "S" , "M" , "L" )
),
weights = rnorm (6 , as.numeric (size) + 2 )
)
fruits
# A tibble: 6 × 5
type color year size weights
<chr> <chr> <dbl> <fct> <dbl>
1 apple red 2010 XS 2.25
2 mandarin orange 2010 S 5.12
3 apple red 2012 M 4.71
4 mandarin orange 2010 S 2.17
5 mandarin orange 2011 S 4.64
6 mandarin orange 2012 M 5.43
fruits %>% select (type, size)
# A tibble: 6 × 2
type size
<chr> <fct>
1 apple XS
2 mandarin S
3 apple M
4 mandarin S
5 mandarin S
6 mandarin M
Note that the levels of size are XS, S, M, L and that L does not appear in the data.
If we expand the type and size columns, all combinations of all levels will appear.
fruits %>% expand (type, size)
# A tibble: 8 × 2
type size
<chr> <fct>
1 apple XS
2 apple S
3 apple M
4 apple L
5 mandarin XS
6 mandarin S
7 mandarin M
8 mandarin L
We use nesting() to perform the expansion only with the values that appear (i.e., the missing levels won’t be used!):
fruits %>% expand (type, nesting (size))
# A tibble: 6 × 2
type size
<chr> <fct>
1 apple XS
2 apple S
3 apple M
4 mandarin XS
5 mandarin S
6 mandarin M
Expand and complete grids
The {clock} and {lubridate} packages have functions to create datetime ranges:
start = as.Date ("2010-01-01" )
end = as.Date ("2010-01-31" )
clock:: date_seq (from = start, to = end, by= 1 )
[1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05"
[6] "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
[11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15"
[16] "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
[21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25"
[26] "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"
[31] "2010-01-31"
clock:: date_seq (from = start, by = 1 , total_size = 31 )
[1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05"
[6] "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
[11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15"
[16] "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
[21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25"
[26] "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"
[31] "2010-01-31"
We can use these to create grids:
expand_grid (
date = clock:: date_seq (from = start, by = 1 , total_size = 31 ),
from = c ("Duomo" , "Centrale FS" , "Garibaldi FS" ),
)
# A tibble: 93 × 2
date from
<date> <chr>
1 2010-01-01 Duomo
2 2010-01-01 Centrale FS
3 2010-01-01 Garibaldi FS
4 2010-01-02 Duomo
5 2010-01-02 Centrale FS
6 2010-01-02 Garibaldi FS
7 2010-01-03 Duomo
8 2010-01-03 Centrale FS
9 2010-01-03 Garibaldi FS
10 2010-01-04 Duomo
# … with 83 more rows
The functions above are useful, but to use them with a dataset we need to use some functions to join and possibly replace missing values. Fortunately for us, the complete is a wrapper around expand and other join functions to simplify the interface:
fruits %>% complete (type, size, year)
# A tibble: 25 × 5
type size year color weights
<chr> <fct> <dbl> <chr> <dbl>
1 apple XS 2010 red 2.25
2 apple XS 2011 <NA> NA
3 apple XS 2012 <NA> NA
4 apple S 2010 <NA> NA
5 apple S 2011 <NA> NA
6 apple S 2012 <NA> NA
7 apple M 2010 <NA> NA
8 apple M 2011 <NA> NA
9 apple M 2012 red 4.71
10 apple L 2010 <NA> NA
# … with 15 more rows
But we can also use a group-by, which will return the full table (with missing values).
fruits %>% group_by (type) %>% complete (size, year)
# A tibble: 21 × 5
# Groups: type [2]
type size year color weights
<chr> <fct> <dbl> <chr> <dbl>
1 apple XS 2010 red 2.25
2 apple XS 2012 <NA> NA
3 apple S 2010 <NA> NA
4 apple S 2012 <NA> NA
5 apple M 2010 <NA> NA
6 apple M 2012 red 4.71
7 apple L 2010 <NA> NA
8 apple L 2012 <NA> NA
9 mandarin XS 2010 <NA> NA
10 mandarin XS 2011 <NA> NA
# … with 11 more rows
To fill the NAs, we can use an ifelse or case_when statement for qualitative/categorical variables:
fruits %>%
group_by (type) %>%
complete (size, year) %>%
mutate (
color = ifelse (type == "apple" , "red" , "orange" ),
)
# A tibble: 21 × 5
# Groups: type [2]
type size year color weights
<chr> <fct> <dbl> <chr> <dbl>
1 apple XS 2010 red 2.25
2 apple XS 2012 red NA
3 apple S 2010 red NA
4 apple S 2012 red NA
5 apple M 2010 red NA
6 apple M 2012 red 4.71
7 apple L 2010 red NA
8 apple L 2012 red NA
9 mandarin XS 2010 orange NA
10 mandarin XS 2011 orange NA
# … with 11 more rows
fruits %>%
group_by (type) %>%
complete (size, year) %>%
arrange (type, year) %>%
mutate (
color = case_when (
type == "apple" ~ "red" ,
TRUE ~ "orange" # this denotes "every other case"
)
)
# A tibble: 21 × 5
# Groups: type [2]
type size year color weights
<chr> <fct> <dbl> <chr> <dbl>
1 apple XS 2010 red 2.25
2 apple S 2010 red NA
3 apple M 2010 red NA
4 apple L 2010 red NA
5 apple XS 2012 red NA
6 apple S 2012 red NA
7 apple M 2012 red 4.71
8 apple L 2012 red NA
9 mandarin XS 2010 orange NA
10 mandarin S 2010 orange 5.12
# … with 11 more rows
The case_when statement is longer, but is useful when there are multiple options. Note that the mutate verb was used and replaced the values that were already present. It would be more optimal to use other functions, such as replace_na and fill. In case of quantitative variables, perhaps a case statement matched with a custom function can help.
fruits %>%
group_by (type) %>%
complete (size, year) %>%
arrange (type, year) %>%
fill (color, .direction = "updown" ) # "downup" is equivalent in this case
# A tibble: 21 × 5
# Groups: type [2]
type size year color weights
<chr> <fct> <dbl> <chr> <dbl>
1 apple XS 2010 red 2.25
2 apple S 2010 red NA
3 apple M 2010 red NA
4 apple L 2010 red NA
5 apple XS 2012 red NA
6 apple S 2012 red NA
7 apple M 2012 red 4.71
8 apple L 2012 red NA
9 mandarin XS 2010 orange NA
10 mandarin S 2010 orange 5.12
# … with 11 more rows
Advanced topics
How to work with factors using {forcats} and with dates using {lubridate} or {clock}
Explore window functions in {dplyr}
Functional programming with {purr}