Associated Material

Module: Module 05 - Transforming data

Readings

Cleaning and Tidying

Cleaning data values

  • Checking data types on data import
  • Creating consistent missing data
  • Converting characters to numeric with as.numeric

Cleaning dataframe column names

  • Using the janitor package to create cleaned dataframe column names

Tidying

Tidy data principles:

  1. Each column is a variable or property that is being measured
  2. Each row is an observation
  3. A single cell should contain a single piece of information

Reshaping data

Wide to long

  • Use pivot_longer from the tidyr package

Long to wide

  • Use pivot_wider from the tidyr package


Combining Data

How to combine datasets together

  • rows
  • columns
  • using data relationships

Binding

  • Combining rows of datasets with same column names together with rbind
  • Combining columns of datasets together with cbind
    • caution on using cbind and maintaining data integrity
    • use joins if possible

Joining

Adding additional columns onto data sets by using a common identifier (key) between datasets to maintain data relationships

Inner join

  • inner_join from dplyr will join datasets on a key and at the same time filter so that only rows where there is a matching key will be kept

Left/Right joins

  • left_join and right_join from dplyr will join datasets on a key and keep all rows from the specified ‘side’ of the join, and filling in the non-matching entries with NA

Full joins

  • full_join from dplyr will join datasets on a key and keep all rows from both datasets filling in columns of non-matching entries with NA


Excercises

  1. Using the rodents_untidy.csv, read the data in and then:
    1. use janitor to change the column names to the “small_camel” case
    2. clean the data “weight” column
      • convert “?” and -999 to NA
      • convert the column to numeric
    3. split the plotLocation column into two columns: “plot” and “location”
    4. make the “plot” column into a numeric
    5. challenge: fix the data in the species column for plot 3. The genus and species columns had accidentally been combined


  1. Using flights and planes from the nycflights13 package, how many flights were flown in April 2013 by aircraft with each engine type? e.g. Turbo-fan, Turbo-jet, etc.


  1. Using the gapminder_yearly_population_total.csv data create “long” a format dataset for
    1. New Zealand’s population from 1900 to 2000
    2. a country of your choice for the same time range


  1. Using the datasets you created in 3, create:
    1. a line plot for each
    2. combine the datasets and create a facet plot using country with 1 column and 2 rows (hint look up facet_grid)

Example solutions

library(tidyverse)
rodents <- read_csv("data/rodents_untidy.csv")
#> Rows: 41 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): Plot location, Date collected, Family, Genus, Species, Weight
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# a)
rodents <- rodents %>% janitor::clean_names(case = "small_camel")

# b)
rodents$weight[rodents$weight %in% c("-999", "?")] <- NA
rodents$weight <- as.numeric(rodents$weight)

# c)
rodents <- rodents %>% separate(plotLocation, into = c("plot", "location"), sep = "_")
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 7 rows [35, 36,
#> 37, 38, 39, 40, 41].

# d)
rodents$plot <- as.numeric(rodents$plot)

# e)
plot1_2 <- rodents %>% 
  filter(plot ==1 | plot == 2)

plot_3 <- rodents %>% 
  filter(plot == 3) %>% 
  select(-genus) %>% 
  separate(species, into = c("genus", "species"), sep=" ")

rodents_clean <- rbind(plot1_2, plot_3)
rodents_clean
#> # A tibble: 41 × 7
#>     plot location dateCollected family       genus     species     weight
#>    <dbl> <chr>    <chr>         <chr>        <chr>     <chr>        <dbl>
#>  1     1 slope    01/09/14      Heteromyidae Dipodomys merriami        40
#>  2     1 slope    01/09/14      Heteromyidae Dipodomys merriami        36
#>  3     1 slope    01/09/14      Heteromyidae Dipodomys spectabilis    135
#>  4     1 rocks    01/09/14      Heteromyidae Dipodomys merriami        39
#>  5     1 grass    01/20/14      Heteromyidae Dipodomys merriami        43
#>  6     1 rocks    01/20/14      Heteromyidae Dipodomys spectabilis    144
#>  7     1 rocks    03/13/14      Heteromyidae Dipodomys merriami        51
#>  8     1 fence    03/13/14      Heteromyidae Dipodomys merriami        44
#>  9     1 fence    03/13/14      Heteromyidae Dipodomys spectabilis    146
#> 10     2 rocks    01/08/14      Cricetidae   Neotoma   albigula        NA
#> # … with 31 more rows

library(tidyverse)
library(nycflights13)

flights %>% 
  filter(month == 3 & year == 2013 & !is.na(arr_time)) %>%
  inner_join(planes, by = "tailnum") %>% 
  group_by(engine) %>% 
  summarise(n_flights = n())
#> # A tibble: 6 × 2
#>   engine        n_flights
#>   <chr>             <int>
#> 1 4 Cycle               3
#> 2 Reciprocating       165
#> 3 Turbo-fan         20114
#> 4 Turbo-jet          3391
#> 5 Turbo-prop            9
#> 6 Turbo-shaft          48

library(tidyverse)
gapminder_yearly_pop <- read_csv("data/gapminder_yearly_population_millions_total.csv") %>% mutate(across(-country, ))

gapminder_yearly_pop_long <- gapminder_yearly_pop %>% pivot_longer(-country, names_to = "year", values_to = "population_millions") %>% mutate(year = as.numeric(year))

# a)
nz_1900_2000 <- gapminder_yearly_pop_long %>% 
  filter(between(year, 1900, 2000)) %>% 
  filter(country == "New Zealand")

# b)
jamaica_1900_2000 <- gapminder_yearly_pop_long %>% 
  filter(between(year, 1900, 2000)) %>% 
  filter(country == "Jamaica")

# a)
nz_1900_2000 %>% ggplot(aes(x = year, y = population_millions)) + geom_line() + labs(title = "NZ population 1900-2000")

jamaica_1900_2000 %>% ggplot(aes(x = year, y = population_millions)) + geom_line() + labs(title = "Jamaica population 1900-2000")

# b)
rbind(nz_1900_2000, jamaica_1900_2000) %>% 
  ggplot(aes(x = year, y = population_millions)) + 
  geom_line() + 
  facet_grid(rows = "country") + 
  labs(title = " NZ vs Jamaica population 1900-2000")
LS0tCnRpdGxlOiAiWm9vbSBOb3RlcyAwNTogVHJhbnNmb3JtaW5nIERhdGEiCmRhdGU6ICJTZW1lc3RlciAyLCAyMDIyIgpvdXRwdXQ6CiAgaHRtbF9kb2N1bWVudDoKICAgIHRvYzogdHJ1ZQogICAgdG9jX2Zsb2F0OiB0cnVlCiAgICB0b2NfZGVwdGg6IDMKICAgIGNvZGVfZG93bmxvYWQ6IHRydWUKICAgIGNvZGVfZm9sZGluZzogc2hvdwotLS0KCj4gIyMjIyBBc3NvY2lhdGVkIE1hdGVyaWFsCj4KPiBNb2R1bGU6IFtNb2R1bGUgMDUgLSBUcmFuc2Zvcm1pbmcgZGF0YV0oMDUtdHJhbnNmb3JtLmh0bWwpCj4KPiBSZWFkaW5ncwo+Cj4gLSBbUiBmb3IgRGF0YSBTY2llbmNlIENoYXB0ZXIgMTBdKGh0dHBzOi8vcjRkcy5oYWQuY28ubnovdGliYmxlcy5odG1sKQo+IC0gW1IgZm9yIERhdGEgU2NpZW5jZSBDaGFwdGVyIDExXShodHRwczovL3I0ZHMuaGFkLmNvLm56L2RhdGEtaW1wb3J0Lmh0bWwpCj4gLSBbUiBmb3IgRGF0YSBTY2llbmNlIENoYXB0ZXIgMTJdKGh0dHBzOi8vcjRkcy5oYWQuY28ubnovdGlkeS1kYXRhLmh0bWwpCj4gLSBbUiBmb3IgRGF0YSBTY2llbmNlIENoYXB0ZXIgMTNdKGh0dHBzOi8vcjRkcy5oYWQuY28ubnovcmVsYXRpb25hbC1kYXRhLmh0bWwpCj4gLSBbUiBmb3IgRGF0YSBTY2llbmNlIENoYXB0ZXIgMTRdKGh0dHBzOi8vcjRkcy5oYWQuY28ubnovc3RyaW5ncy5odG1sKQoKYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CmxpYnJhcnkoa25pdHIpCmxpYnJhcnkodGlkeXZlcnNlKQoKa25pdHI6Om9wdHNfY2h1bmskc2V0KAogIGNvbW1lbnQgPSAiIz4iLAogIGZpZy5wYXRoID0gImZpZ3VyZXMvIiwgIyB1c2Ugb25seSBmb3Igc2luZ2xlIFJtZCBmaWxlcwogIGNvbGxhcHNlID0gVFJVRSwKICBlY2hvID0gVFJVRQopCgoKYGBgCgojIyBDbGVhbmluZyBhbmQgVGlkeWluZwoKIyMjIENsZWFuaW5nIGRhdGEgdmFsdWVzCgotIENoZWNraW5nIGRhdGEgdHlwZXMgb24gZGF0YSBpbXBvcnQKLSBDcmVhdGluZyBjb25zaXN0ZW50IG1pc3NpbmcgZGF0YQotIENvbnZlcnRpbmcgY2hhcmFjdGVycyB0byBudW1lcmljIHdpdGggYGFzLm51bWVyaWNgCgojIyMgQ2xlYW5pbmcgZGF0YWZyYW1lIGNvbHVtbiBuYW1lcwoKLSBVc2luZyB0aGUgYGphbml0b3JgIHBhY2thZ2UgdG8gY3JlYXRlIGNsZWFuZWQgZGF0YWZyYW1lIGNvbHVtbiBuYW1lcwoKIyMjIFRpZHlpbmcKClRpZHkgZGF0YSBwcmluY2lwbGVzOgoKMS4gRWFjaCBjb2x1bW4gaXMgYSB2YXJpYWJsZSBvciBwcm9wZXJ0eSB0aGF0IGlzIGJlaW5nIG1lYXN1cmVkIAoyLiBFYWNoIHJvdyBpcyBhbiBvYnNlcnZhdGlvbgozLiBBIHNpbmdsZSBjZWxsIHNob3VsZCBjb250YWluIGEgc2luZ2xlIHBpZWNlIG9mIGluZm9ybWF0aW9uCgojIyBSZXNoYXBpbmcgZGF0YQoKIyMjIFdpZGUgdG8gbG9uZwoKLSBVc2UgYHBpdm90X2xvbmdlcmAgZnJvbSB0aGUgYHRpZHlyYCBwYWNrYWdlCgojIyMgTG9uZyB0byB3aWRlCgotIFVzZSBgcGl2b3Rfd2lkZXJgIGZyb20gdGhlIGB0aWR5cmAgcGFja2FnZQoKXAoKIyMgQ29tYmluaW5nIERhdGEKCkhvdyB0byBjb21iaW5lIGRhdGFzZXRzIHRvZ2V0aGVyCgotIHJvd3MKLSBjb2x1bW5zCi0gdXNpbmcgZGF0YSByZWxhdGlvbnNoaXBzCgojIyMgQmluZGluZwoKLSBDb21iaW5pbmcgcm93cyBvZiBkYXRhc2V0cyB3aXRoIHNhbWUgY29sdW1uIG5hbWVzIHRvZ2V0aGVyIHdpdGggYHJiaW5kYAotIENvbWJpbmluZyBjb2x1bW5zIG9mIGRhdGFzZXRzIHRvZ2V0aGVyIHdpdGggYGNiaW5kYAogIC0gY2F1dGlvbiBvbiB1c2luZyBgY2JpbmRgIGFuZCBtYWludGFpbmluZyBkYXRhIGludGVncml0eQogIC0gdXNlIGpvaW5zIGlmIHBvc3NpYmxlCgojIyMgSm9pbmluZwoKQWRkaW5nIGFkZGl0aW9uYWwgY29sdW1ucyBvbnRvIGRhdGEgc2V0cyBieSB1c2luZyBhIGNvbW1vbiBpZGVudGlmaWVyIChrZXkpIGJldHdlZW4gZGF0YXNldHMgdG8gbWFpbnRhaW4gZGF0YSByZWxhdGlvbnNoaXBzCgojIyMjIElubmVyIGpvaW4KCi0gYGlubmVyX2pvaW5gIGZyb20gYGRwbHlyYCB3aWxsIGpvaW4gZGF0YXNldHMgb24gYSBrZXkgYW5kIGF0IHRoZSBzYW1lIHRpbWUgZmlsdGVyIHNvIHRoYXQgb25seSByb3dzIHdoZXJlIHRoZXJlIGlzIGEgbWF0Y2hpbmcga2V5IHdpbGwgYmUga2VwdAoKIyMjIyBMZWZ0L1JpZ2h0IGpvaW5zCgotIGBsZWZ0X2pvaW5gIGFuZCBgcmlnaHRfam9pbmAgZnJvbSBgZHBseXJgIHdpbGwgam9pbiBkYXRhc2V0cyBvbiBhIGtleSBhbmQga2VlcCBhbGwgcm93cyBmcm9tIHRoZSBzcGVjaWZpZWQgJ3NpZGUnIG9mIHRoZSBqb2luLCBhbmQgZmlsbGluZyBpbiB0aGUgbm9uLW1hdGNoaW5nIGVudHJpZXMgd2l0aCBgTkFgCgojIyMjIEZ1bGwgam9pbnMKCi0gYGZ1bGxfam9pbmAgZnJvbSBgZHBseXJgIHdpbGwgam9pbiBkYXRhc2V0cyBvbiBhIGtleSBhbmQga2VlcCBhbGwgcm93cyBmcm9tIGJvdGggZGF0YXNldHMgZmlsbGluZyBpbiBjb2x1bW5zIG9mIG5vbi1tYXRjaGluZyBlbnRyaWVzIHdpdGggYE5BYAoKXAoKIyMgRXhjZXJjaXNlcwoKMS4gVXNpbmcgdGhlIHJvZGVudHNfdW50aWR5LmNzdiwgcmVhZCB0aGUgZGF0YSBpbiBhbmQgdGhlbjoKICAgIGEuIHVzZSBqYW5pdG9yIHRvIGNoYW5nZSB0aGUgY29sdW1uIG5hbWVzIHRvIHRoZSAic21hbGxfY2FtZWwiIGNhc2UKICAgIGIuIGNsZWFuIHRoZSBkYXRhICJ3ZWlnaHQiIGNvbHVtbgogICAgICAgIC0gY29udmVydCAiPyIgYW5kIC05OTkgdG8gTkEKICAgICAgICAtIGNvbnZlcnQgdGhlIGNvbHVtbiB0byBudW1lcmljCiAgICBjLiBzcGxpdCB0aGUgcGxvdExvY2F0aW9uIGNvbHVtbiBpbnRvIHR3byBjb2x1bW5zOiAicGxvdCIgYW5kICJsb2NhdGlvbiIKICAgIGQuIG1ha2UgdGhlICJwbG90IiBjb2x1bW4gaW50byBhIG51bWVyaWMKICAgIGUuICoqY2hhbGxlbmdlKio6IGZpeCB0aGUgZGF0YSBpbiB0aGUgc3BlY2llcyBjb2x1bW4gZm9yIHBsb3QgMy4gVGhlIGdlbnVzIGFuZCBzcGVjaWVzIGNvbHVtbnMgaGFkIGFjY2lkZW50YWxseSBiZWVuIGNvbWJpbmVkCiAgICAKXAoKMi4gVXNpbmcgYGZsaWdodHNgIGFuZCBgcGxhbmVzYCBmcm9tIHRoZSBgbnljZmxpZ2h0czEzYCBwYWNrYWdlLCBob3cgbWFueSBmbGlnaHRzIHdlcmUgZmxvd24gaW4gQXByaWwgMjAxMyBieSBhaXJjcmFmdCB3aXRoIGVhY2ggZW5naW5lIHR5cGU/IGUuZy4gVHVyYm8tZmFuLCBUdXJiby1qZXQsIGV0Yy4KClwKCjMuIFVzaW5nIHRoZSBnYXBtaW5kZXJfeWVhcmx5X3BvcHVsYXRpb25fdG90YWwuY3N2IGRhdGEgY3JlYXRlICJsb25nIiBhIGZvcm1hdCBkYXRhc2V0IGZvciAKICAgIGEuIE5ldyBaZWFsYW5kJ3MgcG9wdWxhdGlvbiBmcm9tIDE5MDAgdG8gMjAwMAogICAgYi4gYSBjb3VudHJ5IG9mIHlvdXIgY2hvaWNlIGZvciB0aGUgc2FtZSB0aW1lIHJhbmdlCgpcCgo0LiBVc2luZyB0aGUgZGF0YXNldHMgeW91IGNyZWF0ZWQgaW4gMywgY3JlYXRlOgogICAgYS4gYSBsaW5lIHBsb3QgZm9yIGVhY2gKICAgIGIuIGNvbWJpbmUgdGhlIGRhdGFzZXRzIGFuZCBjcmVhdGUgYSBmYWNldCBwbG90IHVzaW5nIGNvdW50cnkgd2l0aCAxIGNvbHVtbiBhbmQgMiByb3dzIChoaW50IGxvb2sgdXAgYGZhY2V0X2dyaWRgKQoKIyMjIEV4YW1wbGUgc29sdXRpb25zCgoxLgoKYGBge3Igem4wNS1zb2xuMSwgY2xhc3Muc291cmNlID0gImZvbGQtaGlkZSJ9CmxpYnJhcnkodGlkeXZlcnNlKQpyb2RlbnRzIDwtIHJlYWRfY3N2KCJkYXRhL3JvZGVudHNfdW50aWR5LmNzdiIpCiMgYSkKcm9kZW50cyA8LSByb2RlbnRzICU+JSBqYW5pdG9yOjpjbGVhbl9uYW1lcyhjYXNlID0gInNtYWxsX2NhbWVsIikKCiMgYikKcm9kZW50cyR3ZWlnaHRbcm9kZW50cyR3ZWlnaHQgJWluJSBjKCItOTk5IiwgIj8iKV0gPC0gTkEKcm9kZW50cyR3ZWlnaHQgPC0gYXMubnVtZXJpYyhyb2RlbnRzJHdlaWdodCkKCiMgYykKcm9kZW50cyA8LSByb2RlbnRzICU+JSBzZXBhcmF0ZShwbG90TG9jYXRpb24sIGludG8gPSBjKCJwbG90IiwgImxvY2F0aW9uIiksIHNlcCA9ICJfIikKCiMgZCkKcm9kZW50cyRwbG90IDwtIGFzLm51bWVyaWMocm9kZW50cyRwbG90KQoKIyBlKQpwbG90MV8yIDwtIHJvZGVudHMgJT4lIAogIGZpbHRlcihwbG90ID09MSB8IHBsb3QgPT0gMikKCnBsb3RfMyA8LSByb2RlbnRzICU+JSAKICBmaWx0ZXIocGxvdCA9PSAzKSAlPiUgCiAgc2VsZWN0KC1nZW51cykgJT4lIAogIHNlcGFyYXRlKHNwZWNpZXMsIGludG8gPSBjKCJnZW51cyIsICJzcGVjaWVzIiksIHNlcD0iICIpCgpyb2RlbnRzX2NsZWFuIDwtIHJiaW5kKHBsb3QxXzIsIHBsb3RfMykKcm9kZW50c19jbGVhbgpgYGAKCi0tLQoKMi4KCmBgYHtyIHpuMDUtc29sbjIsIGNsYXNzLnNvdXJjZSA9ICJmb2xkLWhpZGUifQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShueWNmbGlnaHRzMTMpCgpmbGlnaHRzICU+JSAKICBmaWx0ZXIobW9udGggPT0gMyAmIHllYXIgPT0gMjAxMyAmICFpcy5uYShhcnJfdGltZSkpICU+JQogIGlubmVyX2pvaW4ocGxhbmVzLCBieSA9ICJ0YWlsbnVtIikgJT4lIAogIGdyb3VwX2J5KGVuZ2luZSkgJT4lIAogIHN1bW1hcmlzZShuX2ZsaWdodHMgPSBuKCkpCmBgYAoKLS0tCgozLgoKYGBge3Igem4wNS1zb2xuMywgY2xhc3Muc291cmNlID0gImZvbGQtaGlkZSIsIGV2YWwgPSBGQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCmdhcG1pbmRlcl95ZWFybHlfcG9wIDwtIHJlYWRfY3N2KCJkYXRhL2dhcG1pbmRlcl95ZWFybHlfcG9wdWxhdGlvbl9taWxsaW9uc190b3RhbC5jc3YiKSAlPiUgbXV0YXRlKGFjcm9zcygtY291bnRyeSwgKSkKCmdhcG1pbmRlcl95ZWFybHlfcG9wX2xvbmcgPC0gZ2FwbWluZGVyX3llYXJseV9wb3AgJT4lIHBpdm90X2xvbmdlcigtY291bnRyeSwgbmFtZXNfdG8gPSAieWVhciIsIHZhbHVlc190byA9ICJwb3B1bGF0aW9uX21pbGxpb25zIikgJT4lIG11dGF0ZSh5ZWFyID0gYXMubnVtZXJpYyh5ZWFyKSkKCiMgYSkKbnpfMTkwMF8yMDAwIDwtIGdhcG1pbmRlcl95ZWFybHlfcG9wX2xvbmcgJT4lIAogIGZpbHRlcihiZXR3ZWVuKHllYXIsIDE5MDAsIDIwMDApKSAlPiUgCiAgZmlsdGVyKGNvdW50cnkgPT0gIk5ldyBaZWFsYW5kIikKCiMgYikKamFtYWljYV8xOTAwXzIwMDAgPC0gZ2FwbWluZGVyX3llYXJseV9wb3BfbG9uZyAlPiUgCiAgZmlsdGVyKGJldHdlZW4oeWVhciwgMTkwMCwgMjAwMCkpICU+JSAKICBmaWx0ZXIoY291bnRyeSA9PSAiSmFtYWljYSIpCmBgYAoKLS0tCgo0LgoKYGBge3Igem4wNS1zb2xuNCwgY2xhc3Muc291cmNlID0gImZvbGQtaGlkZSIsIGV2YWwgPSBGQUxTRX0KIyBhKQpuel8xOTAwXzIwMDAgJT4lIGdncGxvdChhZXMoeCA9IHllYXIsIHkgPSBwb3B1bGF0aW9uX21pbGxpb25zKSkgKyBnZW9tX2xpbmUoKSArIGxhYnModGl0bGUgPSAiTlogcG9wdWxhdGlvbiAxOTAwLTIwMDAiKQoKamFtYWljYV8xOTAwXzIwMDAgJT4lIGdncGxvdChhZXMoeCA9IHllYXIsIHkgPSBwb3B1bGF0aW9uX21pbGxpb25zKSkgKyBnZW9tX2xpbmUoKSArIGxhYnModGl0bGUgPSAiSmFtYWljYSBwb3B1bGF0aW9uIDE5MDAtMjAwMCIpCgojIGIpCnJiaW5kKG56XzE5MDBfMjAwMCwgamFtYWljYV8xOTAwXzIwMDApICU+JSAKICBnZ3Bsb3QoYWVzKHggPSB5ZWFyLCB5ID0gcG9wdWxhdGlvbl9taWxsaW9ucykpICsgCiAgZ2VvbV9saW5lKCkgKyAKICBmYWNldF9ncmlkKHJvd3MgPSAiY291bnRyeSIpICsgCiAgbGFicyh0aXRsZSA9ICIgTlogdnMgSmFtYWljYSBwb3B1bGF0aW9uIDE5MDAtMjAwMCIpCmBgYAoK