Data Focus

Modules and Zoom Notes 2–7 cover topics that are part of the data analysis journey and are all interrelated. Each module will introduce new content and expand on the material covered in previous modules. For example, in Module 2 we introduce the basics of the R plotting systems. Subsequent sessions include more advanced plotting options and techniques.



Associated material:

Module: Module 02 - Visualising Data

Readings:

Topics

Tabular data

  • Edit in Excel
  • Export to plain csv (not UTF-8)
  • Column headers in first row for every column
  • One row for each data record

Folder setup

  • Eventually, use Projects to allow RStudio to manage metadata
  • Separate folders for data files, images, scripts, Rmd, etc.

Importing a data file

  • read.csv
  • Set stringsAsFactors = TRUE to import categorical variables correctly
  • After we meet the tidyverse, can also use read_csv

Checking your imported data

  • head
  • tail
  • str
  • Confirm that each column is of the correct type

Selecting and using columns of data

  • Select an individual column with the $ operator

Base R plots

  • hist for frequency distributions
  • boxplots show central tendency and variability
  • formulas have the form dependent variable ~ linear model of independent variables

Plotting with ggplot

  • Use function ggplot contained in library (and package) ggplot2
  • Complex syntax based on Grammar of Graphics (from computer science)
  • Plots built in layers

Building a graph

  • All plots begin with call to ggplot, passing in a data frame
  • Mappings define relationships between elements of the data and visual features on the plot
  • Use function aes to define a mapping
  • Assign column names to aes arguments x and y to define graph axes
  • Many available arguments to aes; part of the ggplot syntax
  • Select a geometry to determine the kind of plot (e.g. bar graph, scatterplot, line graph, etc.)
  • Additional layers define axes labels, title, legends, and fonts
  • Combine ggplot layer sub-commands with +

Practice Exercises

To practice what we have learned in Module 02, we will use “Palmer’s Penguins”, a real data set from the Palmer Station Long Term Ecological Research program (https://allisonhorst.github.io/palmerpenguins/articles/intro.html). These data are size measurements for three penguin species – Chinstrap, Gentoo and Adelie – on three islands in Antarctica.

Install the package that contains the data (code shown below). Then work through each of the exercises. If you have any questions, email us or send us a message in MS Teams.

Access the data as shown below. These commands initialise an object called penguins, which is a tibble, an enhanced data frame. The additional features of tibbles will be discussed during the next module. For these exercises simply treat object penguins as a normal data frame.


# Install the package (do once on any computer)
install.packages("palmerpenguins")
# Load the library (do at the start of every RStudio session)
library(palmerpenguins)

# Check the data - the data frame name is penguins
str(penguins)
#> tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
#>  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
#>  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
#>  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
#>  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
#>  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
#>  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

The output from str(penguins) indicates that three of the columns in the data frame are Factors. In R, a factor is a categorical variable, usually corresponding to an experimental factor. Although factors look like strings, a factor is restricted to a specific set of legal values, which R infers when the data are loaded. The legal values are called levels, and correspond to the different groups or conditions represented by the factor. For example, column penguins$sex is a factor with levels “female” and “male”.

When our data sets have factors, we often use functions levels and table. Use Google or your favourite text book to explore these functions. Use them to solve the next two exercises.

  1. What are the three different levels of the species factor? What are the three different levels of the island factor?

  2. How many observations are there in the data frame for each of the three species? How many observations are there in the data frame for each of the three islands?

  3. Using base R, generate a histogram showing the distribution of body mass, collapsed across island, species and sex. How would you describe the distribution?

  4. Using ggplot, generate a scatterplot illustrating the relationship between bill length and body mass, collapsed across species, island and sex. Remember to load the library with library(ggplot2) before first use. How would you describe the pattern?

  5. Modify your plot from Exercise 4 so that penguins from the different islands are drawn in different colours. Which island seems to have the heaviest penguins? Without looking any further at the data, formulate at least two possible explanations for the pattern.

  6. Using ggplot, generate a boxplot comparing body mass for the three different species of penguin, and having each of the three boxes drawn in a different colour. What information is missing from this figure that was provided in Exercise 5? What information is easier to see in this figure than in Exercise 5?

  7. Using ggplot, duplicate this figure. You will need to research geometry function geom_bar.


