5 Data Manipulation via dplyr
-
-Let’s briefly recap where we have been so far and where we are headed. In Chapter 3, we discussed what it means for data to be tidy. We saw that this refers to observational units corresponding to rows and variables being stored in columns. The entries in the data frame correspond to different combinations of observational units and variables. In the flights
data frame, we saw that each row corresponded to a different flight leaving New York City. (In other words, the observational unit of that tidy data frame is a flight.) The variables are listed as columns and for flights
they include both quantitative variables like dep_delay
and distance
but also categorical variables like carrier
and origin
. An entry in the table corresponds to a particular flight on a given day and a particular value of a given variable representing that flight.
We saw in Chapter ?? that organizing data in this tidy way makes it easy for us to produce graphics. We can simply specify what variable/column we would like on one axis, what variable we’d like on the other axis, and what type of plot we’d like to make. We can also do things such as changing the color by another variable or change the size of our points by a fourth variable given this tidy data set.
-In Chapter ??, we also introduced some ways to summarize and manipulate data to suit your needs. This chapter focuses more on the details of this by giving a variety of examples using the four main verbs in the dplyr
package (Wickham and Francois 2016). There are more advanced operations that can be done than these and you’ll see some examples of this near the end of the chapter.
Needed packages
-library(dplyr)
-library(ggplot2)
-library(nycflights13)
-library(knitr)
5.1 The pipe %>%
-Just as the +
sign was used to add layers to a plot created using ggplot
we will use the pipe operator (%>%
) to chain together dplyr
functions. We read the pipe operator as “and then”. The %>%
operator allows us to go from one step in dplyr
to the next easily so we can filter
our data frame to only focus on a few rows, and then take that filtered data set, and group_by
another variable, and then lastly summarize
this grouped data to calculate the mean for each level of the group.
The piping syntax will be our major focus throughout the rest of this book and you’ll find that you’ll quickly be addicted to the chaining with some practice. If you’d like to see more examples on using dplyr
, the 4MV (in addition to some other dplyr
verbs), and %>%
with the nycflights13
data set, you can check out Chapter 5 of Hadley and Garrett’s book (Grolemund and Wickham 2016).
5.2 Four Main Verbs - The 4MV
-The d
in dplyr
stands for data frames so the functions here work when you are working with objects of the data frame type. It’s most important for you to focus on the four most commonly used functions that help us manipulate and summarize data. A description of these verbs follows with each subsection devoted to seeing an example of that verb in play (or a combination of a few verbs):
-
-
filter
: Pick rows based on conditions about their values
-summarize
: Create summary measures of variables (or groups of observations on variables usinggroup_by
)
-mutate
: Make a new variable in the data frame
-arrange
: Sort the rows based on one or more variables
-
Just as we had the 5NG (The Five Named Graphs in Chapter ?? using ggplot2
), we have the 4MV here (The Four Main Verbs in dplyr
):
5.2.1 Filter observations using filter
-All of the 4MVs follow the same syntax with the argument before the pipe being the name of the data frame and then the name of the verb with other arguments specifying which criteria you’d like the verb to work with in parantheses.
-The filter
function here works much like the “Filter” option in Microsoft Excel. It allows you to specify criteria about values of a variable in your data set and then chooses only those rows that match that criteria. We begin by focusing only on flights from New York City to Portland, Oregon. The dest
code (or airport code) for Portland, Oregon is "PDX"
:
portland_flights <- flights %>% filter(dest == "PDX")
-portland_flights
## # A tibble: 1,354 × 19
-## year month day dep_time sched_dep_time dep_delay arr_time
-## <int> <int> <int> <int> <int> <dbl> <int>
-## 1 2013 1 1 1739 1740 -1 2051
-## 2 2013 1 1 1805 1757 8 2117
-## 3 2013 1 1 2052 2029 23 2349
-## 4 2013 1 2 804 805 -1 1039
-## 5 2013 1 2 1552 1550 2 1853
-## 6 2013 1 2 1727 1720 7 2042
-## 7 2013 1 2 1738 1740 -2 2028
-## 8 2013 1 2 2024 2029 -5 2314
-## 9 2013 1 3 1755 1745 10 2110
-## 10 2013 1 3 1814 1727 47 2108
-## # ... with 1,344 more rows, and 12 more variables:
-## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
-## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
-## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
-Note the second equals sign here. You are almost guaranteed to make the mistake at least once of only including one equals sign. Let’s see what happens when we make this error:
-portland_flights <- flights %>% filter(dest = "PDX")
Error: filter() takes unnamed arguments. Do you need `==`?
-You should run View(pdx_flights)
to glance at the data in spreadsheet form and ensure that only flights heading to Portland are chosen here.
You can combine multiple criteria together using operators that make comparisons:
--
-
|
corresponds to “or”
-&
corresponds to “and”
-
We can often skip the use of &
and just separate our conditions with a comma. You’ll see this in the example below.
In addition, you can use other mathematical checks (similar to ==
):
-
-
>
corresponds to “greater than”
-<
corresponds to “less than”
->=
corresponds to “greater than or equal to”
-<=
corresponds to “less than or equal to”
-!=
corresponds to “not equal to”
-
To see many of these in action, let’s select all flights that left JFK airport heading to Burlington, Vermont ("BTV"
) or Seattle, Washington ("SEA"
) in the months of October, November, or December:
btv_sea_flights_fall <- flights %>% filter(
- origin == "JFK",
- (dest == "BTV") | (dest == "SEA"),
- month >= 10)
Another example uses the !
to pick rows that DON’T match a condition. Here we are referring to excluding the Northern Hemisphere summer months of June, July, and August.
not_summer_flights <- flights %>% filter(!between(month, 6, 8))
-not_summer_flights
## # A tibble: 249,781 × 19
-## year month day dep_time sched_dep_time dep_delay arr_time
-## <int> <int> <int> <int> <int> <dbl> <int>
-## 1 2013 1 1 517 515 2 830
-## 2 2013 1 1 533 529 4 850
-## 3 2013 1 1 542 540 2 923
-## 4 2013 1 1 544 545 -1 1004
-## 5 2013 1 1 554 600 -6 812
-## 6 2013 1 1 554 558 -4 740
-## 7 2013 1 1 555 600 -5 913
-## 8 2013 1 1 557 600 -3 709
-## 9 2013 1 1 557 600 -3 838
-## 10 2013 1 1 558 600 -2 753
-## # ... with 249,771 more rows, and 12 more variables:
-## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
-## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
-## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
-To check that we are correct here we can use the count
function in the dplyr
package on the month
variable in our not_summer_flights
data frame to ensure June, July, and August are not selected:
not_summer_flights %>% count(month)
## # A tibble: 1 × 1
-## `1.n`
-## <int>
-## 1 249781
-The function between
is a shortcut. We could also have written the following to get the same result:
not_summer2 <- flights %>% filter(month <= 5 | month >= 9)
-not_summer2 %>% count(month)
## # A tibble: 1 × 1
-## `1.n`
-## <int>
-## 1 249781
--Learning check -
-(LC5.1) What’s another way using !
we could filter only the rows that are not summer months (June, July, or August) in the flights
data frame?
-
5.2.2 Summarize variables using summarize
-We saw in Subsection ?? a way to calculate the standard deviation and mean of the temperature variable temp
in the weather
data frame of nycflights
. We can do so in one step using the summarize
function in dplyr
:
weather %>% summarize(mean = mean(temp), std_dev = sd(temp))
## # A tibble: 1 × 2
-## mean std_dev
-## <dbl> <dbl>
-## 1 NA NA
-What happened here? The mean and the standard deviation temperatures are missing? Remember that by default the mean
and sd
functions do not ignore missing values. We need to specify TRUE
for the na.rm
parameter:
summary_temp <- weather %>%
- summarize(mean = mean(temp, na.rm = TRUE),
- std_dev = sd(temp, na.rm = TRUE))
-summary_temp
## # A tibble: 1 × 2
-## mean std_dev
-## <dbl> <dbl>
-## 1 55.20351 17.78212
-
-We’ve created a small data frame here called summary_temp
that includes both the mean
and the std_dev
of the temp
variable in weather
. If we’d like to access either of these values directly we can use the $
to specify a column in a data frame:
summary_temp$mean
## [1] 55.20351
-summary_temp$std_dev
## [1] 17.78212
-It’s often more useful to summarize a variable based on the groupings of another variable. Let’s say we were interested in the mean and standard deviation of temperatures for each month. We believe that you will be amazed at just how simple this is:
-summary_tempXmonth <- weather %>%
- group_by(month) %>%
- summarize(mean = mean(temp, na.rm = TRUE),
- std_dev = sd(temp, na.rm = TRUE))
-summary_tempXmonth
## # A tibble: 12 × 3
-## month mean std_dev
-## <dbl> <dbl> <dbl>
-## 1 1 35.64127 10.185459
-## 2 2 34.15454 6.940228
-## 3 3 39.81404 6.224948
-## 4 4 51.67094 8.785250
-## 5 5 61.59185 9.608687
-## 6 6 72.14500 7.603356
-## 7 7 80.00967 7.147631
-## 8 8 74.40495 5.171365
-## 9 9 67.42582 8.475824
-## 10 10 60.03305 8.829652
-## 11 11 45.10893 10.502249
-## 12 12 38.36811 9.940822
-By simply grouping the weather
data set by month
first and then passing this new data frame into summarize
we get a resulting data frame that shows the mean and standard deviation temperature for each month in New York City.
Another useful function is the n
function which gives a count of how many entries appeared in the groupings. Suppose we’d like to get a sense for how many flights departed each of the three airports in New York City:
by_origin <- flights %>%
- group_by(origin) %>%
- summarize(count = n())
-by_origin
## # A tibble: 3 × 2
-## origin count
-## <chr> <int>
-## 1 EWR 120835
-## 2 JFK 111279
-## 3 LGA 104662
-We see that Newark ("EWR"
) had the most flights departing in 2013 followed by "JFK"
and lastly by LaGuardia ("LGA"
).
-Learning check -
-(LC5.2) Recall from Chapter ?? when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the summary_tempXmonth
data frame tell us about temperatures in New York City throughout the year?
(LC5.3) What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?
-(LC5.4) How could we identify how many flights left each of the three airports in each of the months of 2013?
--
5.2.3 Create new variables/change old variables using mutate
-When looking at the flights
data set, there are some clear additional variables that could be calculated based on the values of variables already in the data set. Passengers are often frustrated when their flights departs late, but change their mood a bit if pilots can make up some time during the flight to get them to their destination close to when they expected to land. This is commonly referred to as “gain” and we will create this variable using the mutate
function. Note that we have also overwritten the flights
data frame with what it was before as well as an additional variable gain
here.
flights <- flights %>% mutate(gain = arr_delay - dep_delay)
We can now look at summary measures of this gain
variable and even plot it in the form of a histogram:
gain_summary <- flights %>%
- summarize(
- min = min(gain, na.rm = TRUE),
- q1 = quantile(gain, 0.25, na.rm = TRUE),
- median = quantile(gain, 0.5, na.rm = TRUE),
- q3 = quantile(gain, 0.75, na.rm = TRUE),
- max = max(gain, na.rm = TRUE),
- mean = mean(gain, na.rm = TRUE),
- sd = sd(gain, na.rm = TRUE),
- missing = sum(is.na(gain))
-)
-gain_summary
## # A tibble: 1 × 8
-## min q1 median q3 max mean sd missing
-## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
-## 1 -109 -17 -7 3 196 -5.659779 18.04365 9430
-We’ve recreated the summary
function we saw in Chapter ?? here using the summarize
function in dplyr
.
library(ggplot2)
-ggplot(data = flights, mapping = aes(x = gain)) +
- geom_histogram(color = "white", bins = 20)
We can also create multiple columns at once and even refer to columns that were just created in a new column. Hadley produces one such example in Chapter 5 of “R for Data Science” (Grolemund and Wickham 2016):
-flights_plus <- flights %>% mutate(
- gain = arr_delay - dep_delay,
- hours = air_time / 60,
- gain_per_hour = gain / hours
-)
-
-Learning check -
-(LC5.5) What do positive values of the gain
variable in flights_plus
correspond to? What about negative values? And what about a zero value?
(LC5.6) Could we create the dep_delay
and arr_delay
columns by simply subtracting dep_time
from sched_dep_time
and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights
.
(LC5.7) What can we say about the distribution of gain
? Describe it in a few sentences using the plot and the gain_summary
data frame values.
-
5.2.4 Reorder the data frame using arrange
-As you may have thought about with the data frames we’ve worked with so far in the book, one of the most common things you’d like to do is sort the data frames by a specific column. Have you ever been asked to calculate a median by hand? This requires you to put the data in order from smallest to highest in value. The dplyr
package has a function called arrange
that we will use to sort/reorder our data according to the values of the specified variable. This is most frequently used after we have used the group_by
and summarize
functions as we will see.
Let’s suppose we were interested in determining the most frequent destination airports from New York City in 2013:
-freq_dest <- flights %>%
- group_by(dest) %>%
- summarize(num_flights = n())
-freq_dest
## # A tibble: 105 × 2
-## dest num_flights
-## <chr> <int>
-## 1 ABQ 254
-## 2 ACK 265
-## 3 ALB 439
-## 4 ANC 8
-## 5 ATL 17215
-## 6 AUS 2439
-## 7 AVL 275
-## 8 BDL 443
-## 9 BGR 375
-## 10 BHM 297
-## # ... with 95 more rows
-You’ll see that by default the values of dest
are displayed in alphabetical order here. Remember to use View()
in the R Console to look at all the values of freq_dest
in spreadsheet format. We are interested in finding those airports that appear most:
freq_dest %>% arrange(num_flights)
## # A tibble: 105 × 2
-## dest num_flights
-## <chr> <int>
-## 1 LEX 1
-## 2 LGA 1
-## 3 ANC 8
-## 4 SBN 10
-## 5 HDN 15
-## 6 MTJ 15
-## 7 EYW 17
-## 8 PSP 19
-## 9 JAC 25
-## 10 BZN 36
-## # ... with 95 more rows
-This is actually giving us the opposite of what we are looking for. It tells us the least frequent destination airports first. To switch the ordering to be descending instead of ascending we use the desc
function:
freq_dest %>% arrange(desc(num_flights))
## # A tibble: 105 × 2
-## dest num_flights
-## <chr> <int>
-## 1 ORD 17283
-## 2 ATL 17215
-## 3 LAX 16174
-## 4 BOS 15508
-## 5 MCO 14082
-## 6 CLT 14064
-## 7 SFO 13331
-## 8 FLL 12055
-## 9 MIA 11728
-## 10 DCA 9705
-## # ... with 95 more rows
--
5.3 Other verbs
-5.3.1 Select variables using select
-We’ve seen that the flights
data frame in the nycflights13
package contains many different variables (19 in fact). You can identify this by running the dim
function or the ncol
function:
data(flights)
-dim(flights)
## [1] 336776 19
-ncol(flights)
## [1] 19
-One of these variables is year
. If you remember the original description of the flights
data frame (or by running ?flights
), you’ll remember that this data correspond to flights in 2013 departing New York City. The year
variable isn’t really a variable here in that it doesn’t vary… flights
actually comes from a larger data set that covers many years. We may want to remove the year
variable from our data set since it won’t be helpful for analysis in this case. To do so easily, we use the select
variable:
flights_small <- flights %>% select( -year)
-names(flights_small)
## [1] "month" "day" "dep_time" "sched_dep_time"
-## [5] "dep_delay" "arr_time" "sched_arr_time" "arr_delay"
-## [9] "carrier" "flight" "tailnum" "origin"
-## [13] "dest" "air_time" "distance" "hour"
-## [17] "minute" "time_hour"
-The names
function gives a listing of all the columns in a data frame. We see that year
has been removed. This was done using a -
in front of the name of the column we’d like to remove.
We could also select specific columns (instead of deselecting columns) by listing them out:
-flight_dep_times <- flights %>% select(month, day, dep_time, sched_dep_time)
-flight_dep_times
## # A tibble: 336,776 × 4
-## month day dep_time sched_dep_time
-## <int> <int> <int> <int>
-## 1 1 1 517 515
-## 2 1 1 533 529
-## 3 1 1 542 540
-## 4 1 1 544 545
-## 5 1 1 554 600
-## 6 1 1 554 558
-## 7 1 1 555 600
-## 8 1 1 557 600
-## 9 1 1 557 600
-## 10 1 1 558 600
-## # ... with 336,766 more rows
-Or we could specify a ranges of columns:
-flight_arr_times <- flights %>% select(month:day, arr_time:sched_arr_time)
-flight_arr_times
## # A tibble: 336,776 × 4
-## month day arr_time sched_arr_time
-## <int> <int> <int> <int>
-## 1 1 1 830 819
-## 2 1 1 850 830
-## 3 1 1 923 850
-## 4 1 1 1004 1022
-## 5 1 1 812 837
-## 6 1 1 740 728
-## 7 1 1 913 854
-## 8 1 1 709 723
-## 9 1 1 838 846
-## 10 1 1 753 745
-## # ... with 336,766 more rows
-The select
function can also be used to reorder columns in combination with the everything
helper function. Let’s suppose we’d like the hour
, minute
, and time_hour
variables, which appear at the end of the flights
data set, to actually appear immediately after the day
variable:
flights_reorder <- flights %>% select(month:day, hour:time_hour, everything())
-names(flights_reorder)
## [1] "month" "day" "hour" "minute"
-## [5] "time_hour" "year" "dep_time" "sched_dep_time"
-## [9] "dep_delay" "arr_time" "sched_arr_time" "arr_delay"
-## [13] "carrier" "flight" "tailnum" "origin"
-## [17] "dest" "air_time" "distance"
-Lastly, the helper functions starts_with
, ends_with
, and contains
can be used to choose column names that match those conditions:
flights_begin_a <- flights %>% select(starts_with("a"))
-flights_begin_a
## # A tibble: 336,776 × 3
-## arr_time arr_delay air_time
-## <int> <dbl> <dbl>
-## 1 830 11 227
-## 2 850 20 227
-## 3 923 33 160
-## 4 1004 -18 183
-## 5 812 -25 116
-## 6 740 12 150
-## 7 913 19 158
-## 8 709 -14 53
-## 9 838 -8 140
-## 10 753 8 138
-## # ... with 336,766 more rows
-flights_delays <- flights %>% select(ends_with("delay"))
-flights_delays
## # A tibble: 336,776 × 2
-## dep_delay arr_delay
-## <dbl> <dbl>
-## 1 2 11
-## 2 4 20
-## 3 2 33
-## 4 -1 -18
-## 5 -6 -25
-## 6 -4 12
-## 7 -5 19
-## 8 -3 -14
-## 9 -3 -8
-## 10 -2 8
-## # ... with 336,766 more rows
-flights_time <- flights %>% select(contains("time"))
-flights_time
## # A tibble: 336,776 × 6
-## dep_time sched_dep_time arr_time sched_arr_time air_time
-## <int> <int> <int> <int> <dbl>
-## 1 517 515 830 819 227
-## 2 533 529 850 830 227
-## 3 542 540 923 850 160
-## 4 544 545 1004 1022 183
-## 5 554 600 812 837 116
-## 6 554 558 740 728 150
-## 7 555 600 913 854 158
-## 8 557 600 709 723 53
-## 9 557 600 838 846 140
-## 10 558 600 753 745 138
-## # ... with 336,766 more rows, and 1 more variables: time_hour <dttm>
-5.3.2 Rename variables using rename
-Another useful function is rename
, which as you may suspect renames one column to another name. Suppose we wanted dep_time
and arr_time
to be departure_time
and arrival_time
instead in the flights_time
data frame:
flights_time <- flights_time %>%
- rename(departure_time = dep_time,
- arrival_time = arr_time)
-names(flights_time)
## [1] "departure_time" "sched_dep_time" "arrival_time" "sched_arr_time"
-## [5] "air_time" "time_hour"
-It’s easy to forget if the new name comes before or after the equals sign. I usually remember this as “New Before, Old After” or NBOA.
-You’ll receive an error if you try to do it the other way:
-Error: Unknown variables: departure_time, arrival_time.
--
-Learning check -
-(LC5.8) What are some ways to select all three of the dest
, air_time
, and distance
variables from flights
? Give the code showing how to do this in at least three different ways.
(LC5.9) How could one use starts_with
, ends_with
, and contains
to select columns from the flights
data frame? Provide three different examples in total: one for starts_with
, one for ends_with
, and one for contains
.
(LC5.10) Why might we want to use the select
function on a data frame?
-
5.3.3 Find the top number of values using top_n
-We can also use the top_n
function which automatically tells us the most frequent num_flights
. We specify the top 10 airports here:
freq_dest %>% top_n(n = 10, wt = num_flights)
## # A tibble: 10 × 2
-## dest num_flights
-## <chr> <int>
-## 1 ATL 17215
-## 2 BOS 15508
-## 3 CLT 14064
-## 4 DCA 9705
-## 5 FLL 12055
-## 6 LAX 16174
-## 7 MCO 14082
-## 8 MIA 11728
-## 9 ORD 17283
-## 10 SFO 13331
-We’ll still need to arrange this by num_flights
though:
freq_dest %>% top_n(n = 10, wt = num_flights) %>%
- arrange(desc(num_flights))
## # A tibble: 10 × 2
-## dest num_flights
-## <chr> <int>
-## 1 ORD 17283
-## 2 ATL 17215
-## 3 LAX 16174
-## 4 BOS 15508
-## 5 MCO 14082
-## 6 CLT 14064
-## 7 SFO 13331
-## 8 FLL 12055
-## 9 MIA 11728
-## 10 DCA 9705
-Note: Remember that I didn’t pull the n
and wt
arguments out of thin air. They can be found by using the ?
function on top_n
.
We can go one stop further and tie together the group_by and summarize functions we used to find the most frequent flights:
-ten_freq_dests <- flights %>%
- group_by(dest) %>%
- summarize(num_flights = n()) %>%
- top_n(n = 10) %>%
- arrange(desc(num_flights))
## Selecting by num_flights
--Learning check -
-paste0("(LC", chap, ".", (lc <- lc + 1), ")")
Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013.
5.4 Joining/merging data frames
-Something you may have thought to yourself as you looked at the most freqent destinations of flights from NYC in 2013 is
--
-
- “What cities are these airports in?” -
- “Is
"ORD"
Orlando?”
- - “Where is
"FLL"
?
-
The nycflights13
data package contains multiple data frames. Instead of having to manually look up different values of airport names corresponding to airport codes like ORD
, we can have R automatically do this “looking up” for us. To do so, we’ll need to tell R how to match one data frame to another data frame. Let’s first check out the airports
data frame inside of R:
View(airports)
The first column faa
corresponds to the airport codes that we saw in dest
in our flights
and subsequent ten_freq_dests
data sets. Hadley and Garrett (Grolemund and Wickham 2016) created the following diagram to help us understand how the different data sets are linked:
We see from View(airports)
that airports
contains a lot of other information about 1458. We are only really interested here in the faa
and name
columns. Let’s use the select
function to only use those variables:
airports_small <- airports %>% select(faa, name)
So if we identify the names of the airports we can use the inner_join
function to bring two different data frames together. Note that we will also rename the subsequent column name
as airport_name
:
named_freq_dests <- ten_freq_dests %>%
- inner_join(airports_small, by = c("dest" = "faa")) %>%
- rename(airport_name = name)
-named_freq_dests
## # A tibble: 10 × 3
-## dest num_flights airport_name
-## <chr> <int> <chr>
-## 1 ORD 17283 Chicago Ohare Intl
-## 2 ATL 17215 Hartsfield Jackson Atlanta Intl
-## 3 LAX 16174 Los Angeles Intl
-## 4 BOS 15508 General Edward Lawrence Logan Intl
-## 5 MCO 14082 Orlando Intl
-## 6 CLT 14064 Charlotte Douglas Intl
-## 7 SFO 13331 San Francisco Intl
-## 8 FLL 12055 Fort Lauderdale Hollywood Intl
-## 9 MIA 11728 Miami Intl
-## 10 DCA 9705 Ronald Reagan Washington Natl
-In case you didn’t know, "ORD"
is the airport code of Chicago O’Hare airport and "FLL"
is the main airport in Fort Lauderdale, Florida, which we can now see in our named_freq_dests
data frame.
A visual representation of the inner_join
is given below (Grolemund and Wickham 2016):
There are more complex joins available, but the inner_join
will solve nearly all of the problems you’ll face in our experience.
-
-Learning check -
-(LC5.11) What happens when you try to inner_join
the ten_freq_dests
data frame with airports
instead of airports_small
? How might one use this result to answer further questions about the top 10 destinations?
(LC5.12) What surprises you about the top 10 destinations from NYC in 2013?
--
As we saw with the RStudio cheatsheet on data visualization, RStudio has also created a cheatsheet for data manipulation entitled “Data Wrangling with dplyr and tidyr” available here. We will focus only on the dplyr
functions in this book, but you are encouraged to also explore tidyr
if you are presented with data that is not in the tidy format that we have specified as the preferred option for our purposes.
5.5 Script of R code
-An R script file of all R code used in this chapter is available here.
-5.6 What’s to come?
-This concludes the Data Exploration unit of this book. You should be pretty proficient in both plotting variables (or multiple variables together) in various data sets and manipulating data as we’ve done in this chapter. You are encouraged to step back through the code in earlier chapters and make changes as you see fit based on your updated knowledge.
-In Chapter ??, we’ll begin to build the pieces needed to understand how this unit of Data Exploration can tie into statistical inference in the Inference part of the book. Remember that the focus throughout is on data visualization and we’ll see that next when we discuss sampling, resampling, and bootstrapping. These ideas will lead us into hypothesis testing and confidence intervals.
- -References
-Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
-Grolemund, Garrett, and Hadley Wickham. 2016. R for Data Science. http://r4ds.had.co.nz/.
-