library(dplyr)
<- nycflights13::flights flights
8 Data management and aggregation
8.1 Introduction (Video)
When working with data you must:
- Figure out what you want to do.
- Precisely describe what you want in the form of a computer program.
- Execute the code.
The dplyr package makes each of these steps as fast and easy as possible by:
- Elucidating the most common data manipulation operations, so that your options are helpfully constrained when thinking about how to tackle a problem.
- Providing simple functions that correspond to the most common data manipulation verbs, so that you can easily translate your thoughts into code.
- Using efficient data storage back ends, so that you spend as little time waiting for the computer as possible.
After completing this lesson students will be able to
- Explain the difference between a
data.table
and atibble
. - Build and execute a chain of command to accomplish a data management task
- Extract certain rows using
select
. - Create new variables using
mutate
. - Rename variables using
rename
- Subset the data based on a criteria using
filter
. - Create summary statistics using
group_by
andsummarize
- Learn how to use code chunk options to disable warning messages.
Prior to this lesson students should
- Download the [08_dplyr_notes.Rmd] R markdown file and save into your
Math130/notes
folder. - Ensure that the
dplyr
andnycflights13
packages are installed by running the first code chunk.
8.1.1 Exploring airline flight data with dplyr.
The nycflights13
package contains several data sets that can be used to help understand what causes delays. We will be using the flights
data set which contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) in 2013.
8.1.2 Tibbles
The flights
data set, and any data set created with dplyr
, has a specific data type called a tibble
. These are not as furry and prolific as their cousins the tribbles
. tibbles
behaves for all intents and purposes as a data.frame
, just gets displayed differently. For example, the flights
data set contains data on 19 characteristics (variables) from 336,776 flights. There’s no way I would want to print out a data set that large. But I’m gonna….
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The output has been trimmed to something more reasonable for our viewing pleasure. This may not seem such a big deal because R Studio already provides some level of truncation for our viewing pleasure.
8.2 Basic verbs (Video)
The dplyr
package contains new data manipulation functions, also called verbs. We will look at the following verbs:
filter()
: Returns a subset of the rows.select()
: Returns only the listed columns.rename()
: Renames the variables listed.mutate()
: Adds columns from existing data.summarise()
: Reduces each group to a single row by calculating aggregate measures.group_by()
: Groups a data set on a factor variable, such that all functions performed are then done on each level of the factor.
8.2.1 Filter
filter()
allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, and the second and subsequent are filtering expressions evaluated in the context of that data frame. For example, we can select all flights on January 1st with
filter(flights, month == 1, day == 1)
# A tibble: 842 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
filter()
works similarly to subset()
except that you can give it any number of filtering conditions which are joined together with &. You can use other Boolean operators explicitly. Here we select flights in January or February.
filter(flights, month == 1 | month == 2)
# A tibble: 51,955 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 51,945 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
8.2.2 Select
Often you work with large data sets with many columns where only a few are actually of interest to you. select()
allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions.
select(flights, month, day, year)
# A tibble: 336,776 × 3
month day year
<int> <int> <int>
1 1 1 2013
2 1 1 2013
3 1 1 2013
4 1 1 2013
5 1 1 2013
6 1 1 2013
7 1 1 2013
8 1 1 2013
9 1 1 2013
10 1 1 2013
# ℹ 336,766 more rows
You can use a colon (:) to select all columns physically located between two variables.
select(flights, year:day)
# A tibble: 336,776 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# ℹ 336,766 more rows
To exclude specific columns you use the minus sign (-)
select(flights, -carrier)
# A tibble: 336,776 × 18
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 10 more variables: arr_delay <dbl>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
This also works to exclude all columns EXCEPT the ones between two variables.
select(flights, -(year:day))
# A tibble: 336,776 × 16
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<int> <int> <dbl> <int> <int> <dbl> <chr>
1 517 515 2 830 819 11 UA
2 533 529 4 850 830 20 UA
3 542 540 2 923 850 33 AA
4 544 545 -1 1004 1022 -18 B6
5 554 600 -6 812 837 -25 DL
6 554 558 -4 740 728 12 UA
7 555 600 -5 913 854 19 B6
8 557 600 -3 709 723 -14 EV
9 557 600 -3 838 846 -8 B6
10 558 600 -2 753 745 8 AA
# ℹ 336,766 more rows
# ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
8.2.3 Rename
Sometimes variables come to you in really obscure naming conventions. What the heck is SBA641? New to dplyr
1.0.0 is the rename()
function. Works like magic to convert old name
to a new name
. The generic syntax is rename(new = old)
So to rename dep_time
to departure_time
we would type
rename(flights, departure_time = dep_time)
# A tibble: 336,776 × 19
year month day departure_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The variable name on the 3rd column now says departure_time
instead of dep_time
.
For the purpose of these lecture notes I am not making this change permenant. There is no assignment operator
<-
used here, so this change is not going to persist into later code.
8.2.4 Mutate
As well as selecting from the set of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate()
!
Here we create two variables: gain (as arrival delay minus departure delay) and speed (as distance divided by time, converted to hours).
<- mutate(flights, gain = arr_delay - dep_delay,
a speed = distance / air_time * 60)
select(a, gain, distance, air_time, speed)
# A tibble: 336,776 × 4
gain distance air_time speed
<dbl> <dbl> <dbl> <dbl>
1 9 1400 227 370.
2 16 1416 227 374.
3 31 1089 160 408.
4 -17 1576 183 517.
5 -19 762 116 394.
6 16 719 150 288.
7 24 1065 158 404.
8 -11 229 53 259.
9 -5 944 140 405.
10 10 733 138 319.
# ℹ 336,766 more rows
One key advantage of mutate is that you can refer to the columns you just created. Mutate flights to create two variables, gain = arr_delay - dep_delay
and gain_per_hour = gain / (air_time / 60)
.
mutate(flights, gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60 ))
# A tibble: 336,776 × 21
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, gain_per_hour <dbl>
8.2.5 Summarize
The last verb is summarise()
, which collapses a data frame to a single row. It’s not very useful yet. We can create a new variable called delay that is the average departure delay on the entire flights data set.
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 × 1
delay
<dbl>
1 12.6
8.3 Grouped Operations (Video)
The above verbs are useful, but they become really powerful when you combine them with the idea of “group by”, repeating the operation individually on groups of observations within the dataset. In dplyr, you use the group_by()
function to describe how to break a dataset down into groups of rows. You can then use the resulting object in exactly the same functions as above; they’ll automatically work “by group” when the input is a grouped.
Let’s demonstrate how some of these functions work after grouping the flights data set by month. First we’ll create a new data set that is grouped by month.
<- group_by(flights, month) by_month
- The
summarise()
verb allows you to calculate summary statistics for each group. This is probably the most common function that is used in conjunction withgroup_by
. For example, the average distance flown per month.
summarise(by_month, avg_airtime = mean(distance, na.rm=TRUE))
# A tibble: 12 × 2
month avg_airtime
<int> <dbl>
1 1 1007.
2 2 1001.
3 3 1012.
4 4 1039.
5 5 1041.
6 6 1057.
7 7 1059.
8 8 1062.
9 9 1041.
10 10 1039.
11 11 1050.
12 12 1065.
Or simply the total number of flights per month.
summarize(by_month, count=n())
# A tibble: 12 × 2
month count
<int> <int>
1 1 27004
2 2 24951
3 3 28834
4 4 28330
5 5 28796
6 6 28243
7 7 29425
8 8 29327
9 9 27574
10 10 28889
11 11 27268
12 12 28135
8.4 Chaining Operations (Video)
Consider the following group of operations that take the data set flights
, and produce a final data set (a4
) that contains only the flights where the daily average delay is greater than a half hour.
<- group_by(flights, year, month, day)
a1 <- select(a1, arr_delay, dep_delay) a2
Adding missing grouping variables: `year`, `month`, `day`
<- summarise(a2,
a3 arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE))
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
<- filter(a3, arr > 30 | dep > 30)
a4 head(a4)
# A tibble: 6 × 5
# Groups: year, month [3]
year month day arr dep
<int> <int> <int> <dbl> <dbl>
1 2013 1 16 34.2 24.6
2 2013 1 31 32.6 28.7
3 2013 2 11 36.3 39.1
4 2013 2 27 31.3 37.8
5 2013 3 8 85.9 83.5
6 2013 3 18 41.3 30.1
It does the trick, but what if you don’t want to save all the intermediate results (a1
- a3
)? Well these verbs are function
, so they can be wrapped inside other functions to create a nesting type structure.
filter(
summarise(
select(
group_by(flights, year, month, day),
arr_delay, dep_delay
),arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
),> 30 | dep > 30
arr )
Woah, that is HARD to read! This is difficult to read because the order of the operations is from inside to out, and the arguments are a long way away from the function. To get around this problem, dplyr provides the %>%
operator. x %>% f(y)
turns into f(x, y)
so you can use it to rewrite multiple operations so you can read from left-to-right, top-to-bottom:
%>%
flights group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
%>%
) filter(arr > 30 | dep > 30)
Adding missing grouping variables: `year`, `month`, `day`
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 49 × 5
# Groups: year, month [11]
year month day arr dep
<int> <int> <int> <dbl> <dbl>
1 2013 1 16 34.2 24.6
2 2013 1 31 32.6 28.7
3 2013 2 11 36.3 39.1
4 2013 2 27 31.3 37.8
5 2013 3 8 85.9 83.5
6 2013 3 18 41.3 30.1
7 2013 4 10 38.4 33.0
8 2013 4 12 36.0 34.8
9 2013 4 18 36.0 34.9
10 2013 4 19 47.9 46.1
# ℹ 39 more rows
Another way you can read this is by thinking “and then” when you see the %>%
operator. So the above code takes the data set flights
.. and then groups by day
.. and then selects the delay variables
.. and then calculates the means
.. and then filters on a delay over half hour.
The same 4 steps that resulted in the a4
data set, but without all the intermediate data saved! This can be very important when dealing with Big Data. R
stores all data in memory, so if your little computer only has 2G of RAM and you’re working with a data set that is 500M in size, your computers memory will be used up fast. a1
takes 500M, a2
another 500M, by now your computer is getting slow. Make another copy at a3
and it gets worse, a4
now likely won’t even be able to be created because you’ll be out of memory.