Associated Material

Module: Module 06 - Tidying data

Readings

Cleaning and Tidying

Cleaning data values

  • Checking data types on data import
  • Creating consistent missing data
  • Converting characters to numeric with as.numeric

Cleaning dataframe column names

  • Using the janitor package to create cleaned dataframe column names

Tidying

Tidy data principles:

  1. Each column is a variable or property that is being measured
  2. Each row is an observation
  3. A single cell should contain a single piece of information

Reshaping data

Wide to long

  • Use pivot_longer from the tidyr package

Long to wide

  • Use pivot_wider from the tidyr package


Excercises

  1. Using the rodents_untidy.csv, read the data in and then:
    1. use janitor to change the column names to the “small_camel” case
    2. clean the data “weight” column
      • convert “?” and -999 to NA
      • convert the column to numeric
    3. split the plotLocation column into two columns: “plot” and “location”
    4. make the “plot” column into a numeric
    5. challenge: fix the data in the species column for plot 3. The genus and species columns had accidentally been combined


  1. Using the gapminder_yearly_population_total.csv data create “long” a format dataset for
    1. New Zealand’s population from 1900 to 2000
    2. a country of your choice for the same time range

Example solutions

library(tidyverse)
rodents <- read_csv("data/rodents_untidy.csv")
#> Rows: 41 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): Plot location, Date collected, Family, Genus, Species, Weight
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# a)
rodents <- rodents %>% janitor::clean_names(case = "small_camel")

# b)
rodents$weight[rodents$weight %in% c("-999", "?")] <- NA
rodents$weight <- as.numeric(rodents$weight)

# c)
rodents <- rodents %>% separate(plotLocation, into = c("plot", "location"), sep = "_")
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 7 rows [35, 36, 37, 38,
#> 39, 40, 41].

# d)
rodents$plot <- as.numeric(rodents$plot)

# e)
plot1_2 <- rodents %>% 
  filter(plot ==1 | plot == 2)

plot_3 <- rodents %>% 
  filter(plot == 3) %>% 
  select(-genus) %>% 
  separate(species, into = c("genus", "species"), sep=" ")

rodents_clean <- rbind(plot1_2, plot_3)
rodents_clean
#> # A tibble: 41 × 7
#>     plot location dateCollected family       genus     species     weight
#>    <dbl> <chr>    <chr>         <chr>        <chr>     <chr>        <dbl>
#>  1     1 slope    01/09/14      Heteromyidae Dipodomys merriami        40
#>  2     1 slope    01/09/14      Heteromyidae Dipodomys merriami        36
#>  3     1 slope    01/09/14      Heteromyidae Dipodomys spectabilis    135
#>  4     1 rocks    01/09/14      Heteromyidae Dipodomys merriami        39
#>  5     1 grass    01/20/14      Heteromyidae Dipodomys merriami        43
#>  6     1 rocks    01/20/14      Heteromyidae Dipodomys spectabilis    144
#>  7     1 rocks    03/13/14      Heteromyidae Dipodomys merriami        51
#>  8     1 fence    03/13/14      Heteromyidae Dipodomys merriami        44
#>  9     1 fence    03/13/14      Heteromyidae Dipodomys spectabilis    146
#> 10     2 rocks    01/08/14      Cricetidae   Neotoma   albigula        NA
#> # ℹ 31 more rows

library(tidyverse)
gapminder_yearly_pop <- read_csv("data/gapminder_yearly_population_millions_total.csv") %>% mutate(across(-country, ))

gapminder_yearly_pop_long <- gapminder_yearly_pop %>% pivot_longer(-country, names_to = "year", values_to = "population_millions") %>% mutate(year = as.numeric(year))

# a)
nz_1900_2000 <- gapminder_yearly_pop_long %>% 
  filter(between(year, 1900, 2000)) %>% 
  filter(country == "New Zealand")

# b)
jamaica_1900_2000 <- gapminder_yearly_pop_long %>% 
  filter(between(year, 1900, 2000)) %>% 
  filter(country == "Jamaica")

LS0tCnRpdGxlOiAiWm9vbSBOb3RlcyAwNjogVGlkeWluZyBEYXRhIgpkYXRlOiAiU2VtZXN0ZXIgMiwgMjAyMyIKb3V0cHV0OgogIGh0bWxfZG9jdW1lbnQ6CiAgICB0b2M6IHRydWUKICAgIHRvY19mbG9hdDogdHJ1ZQogICAgdG9jX2RlcHRoOiAzCiAgICBjb2RlX2Rvd25sb2FkOiB0cnVlCiAgICBjb2RlX2ZvbGRpbmc6IHNob3cKLS0tCgo+ICMjIyMgQXNzb2NpYXRlZCBNYXRlcmlhbAo+Cj4gTW9kdWxlOiBbTW9kdWxlIDA2IC0gVGlkeWluZyBkYXRhXSgwNi10aWR5aW5nLmh0bWwpCj4KPiBSZWFkaW5ncwo+Cj4gLSBbUiBmb3IgRGF0YSBTY2llbmNlIENoYXB0ZXIgMTBdKGh0dHBzOi8vcjRkcy5oYWQuY28ubnovdGliYmxlcy5odG1sKQo+IC0gW1IgZm9yIERhdGEgU2NpZW5jZSBDaGFwdGVyIDExXShodHRwczovL3I0ZHMuaGFkLmNvLm56L2RhdGEtaW1wb3J0Lmh0bWwpCj4gLSBbUiBmb3IgRGF0YSBTY2llbmNlIENoYXB0ZXIgMTJdKGh0dHBzOi8vcjRkcy5oYWQuY28ubnovdGlkeS1kYXRhLmh0bWwpCj4gLSBbUiBmb3IgRGF0YSBTY2llbmNlIENoYXB0ZXIgMTRdKGh0dHBzOi8vcjRkcy5oYWQuY28ubnovc3RyaW5ncy5odG1sKQoKYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CmxpYnJhcnkoa25pdHIpCmxpYnJhcnkodGlkeXZlcnNlKQoKa25pdHI6Om9wdHNfY2h1bmskc2V0KAogIGNvbW1lbnQgPSAiIz4iLAogIGZpZy5wYXRoID0gImZpZ3VyZXMvMDYvIiwgIyB1c2Ugb25seSBmb3Igc2luZ2xlIFJtZCBmaWxlcwogIGNvbGxhcHNlID0gVFJVRSwKICBlY2hvID0gVFJVRQopCgoKYGBgCgojIyBDbGVhbmluZyBhbmQgVGlkeWluZwoKIyMjIENsZWFuaW5nIGRhdGEgdmFsdWVzCgotIENoZWNraW5nIGRhdGEgdHlwZXMgb24gZGF0YSBpbXBvcnQKLSBDcmVhdGluZyBjb25zaXN0ZW50IG1pc3NpbmcgZGF0YQotIENvbnZlcnRpbmcgY2hhcmFjdGVycyB0byBudW1lcmljIHdpdGggYGFzLm51bWVyaWNgCgojIyMgQ2xlYW5pbmcgZGF0YWZyYW1lIGNvbHVtbiBuYW1lcwoKLSBVc2luZyB0aGUgYGphbml0b3JgIHBhY2thZ2UgdG8gY3JlYXRlIGNsZWFuZWQgZGF0YWZyYW1lIGNvbHVtbiBuYW1lcwoKIyMjIFRpZHlpbmcKClRpZHkgZGF0YSBwcmluY2lwbGVzOgoKMS4gRWFjaCBjb2x1bW4gaXMgYSB2YXJpYWJsZSBvciBwcm9wZXJ0eSB0aGF0IGlzIGJlaW5nIG1lYXN1cmVkIAoyLiBFYWNoIHJvdyBpcyBhbiBvYnNlcnZhdGlvbgozLiBBIHNpbmdsZSBjZWxsIHNob3VsZCBjb250YWluIGEgc2luZ2xlIHBpZWNlIG9mIGluZm9ybWF0aW9uCgojIyBSZXNoYXBpbmcgZGF0YQoKIyMjIFdpZGUgdG8gbG9uZwoKLSBVc2UgYHBpdm90X2xvbmdlcmAgZnJvbSB0aGUgYHRpZHlyYCBwYWNrYWdlCgojIyMgTG9uZyB0byB3aWRlCgotIFVzZSBgcGl2b3Rfd2lkZXJgIGZyb20gdGhlIGB0aWR5cmAgcGFja2FnZQoKXAoKIyMgRXhjZXJjaXNlcwoKMS4gVXNpbmcgdGhlIHJvZGVudHNfdW50aWR5LmNzdiwgcmVhZCB0aGUgZGF0YSBpbiBhbmQgdGhlbjoKICAgIGEuIHVzZSBqYW5pdG9yIHRvIGNoYW5nZSB0aGUgY29sdW1uIG5hbWVzIHRvIHRoZSAic21hbGxfY2FtZWwiIGNhc2UKICAgIGIuIGNsZWFuIHRoZSBkYXRhICJ3ZWlnaHQiIGNvbHVtbgogICAgICAgIC0gY29udmVydCAiPyIgYW5kIC05OTkgdG8gTkEKICAgICAgICAtIGNvbnZlcnQgdGhlIGNvbHVtbiB0byBudW1lcmljCiAgICBjLiBzcGxpdCB0aGUgcGxvdExvY2F0aW9uIGNvbHVtbiBpbnRvIHR3byBjb2x1bW5zOiAicGxvdCIgYW5kICJsb2NhdGlvbiIKICAgIGQuIG1ha2UgdGhlICJwbG90IiBjb2x1bW4gaW50byBhIG51bWVyaWMKICAgIGUuICoqY2hhbGxlbmdlKio6IGZpeCB0aGUgZGF0YSBpbiB0aGUgc3BlY2llcyBjb2x1bW4gZm9yIHBsb3QgMy4gVGhlIGdlbnVzIGFuZCBzcGVjaWVzIGNvbHVtbnMgaGFkIGFjY2lkZW50YWxseSBiZWVuIGNvbWJpbmVkCiAgICAKXAoKMi4gVXNpbmcgdGhlIGdhcG1pbmRlcl95ZWFybHlfcG9wdWxhdGlvbl90b3RhbC5jc3YgZGF0YSBjcmVhdGUgImxvbmciIGEgZm9ybWF0IGRhdGFzZXQgZm9yIAogICAgYS4gTmV3IFplYWxhbmQncyBwb3B1bGF0aW9uIGZyb20gMTkwMCB0byAyMDAwCiAgICBiLiBhIGNvdW50cnkgb2YgeW91ciBjaG9pY2UgZm9yIHRoZSBzYW1lIHRpbWUgcmFuZ2UKCiMjIyBFeGFtcGxlIHNvbHV0aW9ucwoKMS4KCmBgYHtyIHpuMDYtc29sbjEsIGNsYXNzLnNvdXJjZSA9ICJmb2xkLWhpZGUifQpsaWJyYXJ5KHRpZHl2ZXJzZSkKcm9kZW50cyA8LSByZWFkX2NzdigiZGF0YS9yb2RlbnRzX3VudGlkeS5jc3YiKQojIGEpCnJvZGVudHMgPC0gcm9kZW50cyAlPiUgamFuaXRvcjo6Y2xlYW5fbmFtZXMoY2FzZSA9ICJzbWFsbF9jYW1lbCIpCgojIGIpCnJvZGVudHMkd2VpZ2h0W3JvZGVudHMkd2VpZ2h0ICVpbiUgYygiLTk5OSIsICI/IildIDwtIE5BCnJvZGVudHMkd2VpZ2h0IDwtIGFzLm51bWVyaWMocm9kZW50cyR3ZWlnaHQpCgojIGMpCnJvZGVudHMgPC0gcm9kZW50cyAlPiUgc2VwYXJhdGUocGxvdExvY2F0aW9uLCBpbnRvID0gYygicGxvdCIsICJsb2NhdGlvbiIpLCBzZXAgPSAiXyIpCgojIGQpCnJvZGVudHMkcGxvdCA8LSBhcy5udW1lcmljKHJvZGVudHMkcGxvdCkKCiMgZSkKcGxvdDFfMiA8LSByb2RlbnRzICU+JSAKICBmaWx0ZXIocGxvdCA9PTEgfCBwbG90ID09IDIpCgpwbG90XzMgPC0gcm9kZW50cyAlPiUgCiAgZmlsdGVyKHBsb3QgPT0gMykgJT4lIAogIHNlbGVjdCgtZ2VudXMpICU+JSAKICBzZXBhcmF0ZShzcGVjaWVzLCBpbnRvID0gYygiZ2VudXMiLCAic3BlY2llcyIpLCBzZXA9IiAiKQoKcm9kZW50c19jbGVhbiA8LSByYmluZChwbG90MV8yLCBwbG90XzMpCnJvZGVudHNfY2xlYW4KYGBgCgotLS0KCgoKMi4KCmBgYHtyIHpuMDYtc29sbjIsIGNsYXNzLnNvdXJjZSA9ICJmb2xkLWhpZGUiLCBldmFsID0gRkFMU0V9CmxpYnJhcnkodGlkeXZlcnNlKQpnYXBtaW5kZXJfeWVhcmx5X3BvcCA8LSByZWFkX2NzdigiZGF0YS9nYXBtaW5kZXJfeWVhcmx5X3BvcHVsYXRpb25fbWlsbGlvbnNfdG90YWwuY3N2IikgJT4lIG11dGF0ZShhY3Jvc3MoLWNvdW50cnksICkpCgpnYXBtaW5kZXJfeWVhcmx5X3BvcF9sb25nIDwtIGdhcG1pbmRlcl95ZWFybHlfcG9wICU+JSBwaXZvdF9sb25nZXIoLWNvdW50cnksIG5hbWVzX3RvID0gInllYXIiLCB2YWx1ZXNfdG8gPSAicG9wdWxhdGlvbl9taWxsaW9ucyIpICU+JSBtdXRhdGUoeWVhciA9IGFzLm51bWVyaWMoeWVhcikpCgojIGEpCm56XzE5MDBfMjAwMCA8LSBnYXBtaW5kZXJfeWVhcmx5X3BvcF9sb25nICU+JSAKICBmaWx0ZXIoYmV0d2Vlbih5ZWFyLCAxOTAwLCAyMDAwKSkgJT4lIAogIGZpbHRlcihjb3VudHJ5ID09ICJOZXcgWmVhbGFuZCIpCgojIGIpCmphbWFpY2FfMTkwMF8yMDAwIDwtIGdhcG1pbmRlcl95ZWFybHlfcG9wX2xvbmcgJT4lIAogIGZpbHRlcihiZXR3ZWVuKHllYXIsIDE5MDAsIDIwMDApKSAlPiUgCiAgZmlsdGVyKGNvdW50cnkgPT0gIkphbWFpY2EiKQpgYGAKCi0tLQoKCgo=