Associated Material
Module: Module 07 - Combining data
Readings
How to combine datasets together
rbind
cbind
cbind
and maintaining data
integrityAdding additional columns onto data sets by using a common identifier (key) between datasets to maintain data relationships
inner_join
from dplyr
will join datasets
on a key and at the same time filter so that only rows where there is a
matching key will be keptleft_join
and right_join
from
dplyr
will join datasets on a key and keep all rows from
the specified ‘side’ of the join, and filling in the non-matching
entries with NA
full_join
from dplyr
will join datasets on
a key and keep all rows from both datasets filling in columns of
non-matching entries with NA
flights
and planes
from the
nycflights13
package, how many flights were flown in April
2013 by aircraft with each engine type? e.g. Turbo-fan, Turbo-jet,
etc.
flights
for the carrier AA
with a
destination (dest
) of IAH
, and carrier
DL
with a destination of ATL
were late? and
how many early?
carrier
and
dest
, and fill in the specific data.flights
,
specifying by = c("carrier", "dest")
to result in a
dataframe with rows matching the conditions, assign into
flights_of_interest
.flights_of_interest
and create two new
datasets, one for the late arrivals (late
) and the second
for early arrivals (early
). How many rows in each?
early
and late
) you
created in 2, create:
early
and late
using arr_delay
by carrier
status
onto each dataset with the
values “early” for early
and “late” for
late
month
, with 6
columns and 2 rows (hint look up facet_wrap
). Are there any
trends you notice?library(tidyverse)
library(nycflights13)
flights %>%
filter(month == 3 & year == 2013 & !is.na(arr_time)) %>%
inner_join(planes, by = "tailnum") %>%
group_by(engine) %>%
summarise(n_flights = n())
#> # A tibble: 6 × 2
#> engine n_flights
#> <chr> <int>
#> 1 4 Cycle 3
#> 2 Reciprocating 165
#> 3 Turbo-fan 20114
#> 4 Turbo-jet 3391
#> 5 Turbo-prop 9
#> 6 Turbo-shaft 48
carrier_dest <- data.frame(carrier = c("AA", "DL"), dest = c("IAH", "ATL"))
flights_of_interest <- inner_join(flights, carrier_dest, by = c("carrier","dest"))
late <- flights_of_interest %>% filter(arr_delay > 0)
early <- flights_of_interest %>% filter(arr_delay < 0)
nrow(late)
#> [1] 4419
nrow(early)
#> [1] 6088
ggplot(early, aes(x = carrier, y = arr_delay)) + geom_boxplot()
ggplot(late, aes(x = carrier, y = arr_delay)) + geom_boxplot()
early <- early %>% mutate(status = "early")
late <- late %>% mutate(status = "late")
rbind(early,late) %>%
ggplot(aes(x = status, y = air_time)) +
geom_boxplot() +
facet_wrap(vars(month), nrow = 2, ncol = 6,)