Associated material
Module: Module 01 - Introducing R and Rstudio
Readings:
This course is designed with two components - the module and the zoom notes. The module content provides more explanation and background so that you can work through it on your own if you wish. The zoom notes content provides the structure for the content that will be delivered in the online sessions. Both sets of content are complementary to each other.
At the top right hand corner of both the module and zoom notes pages is a “code” button. This can be used to toggle seeing/hiding the code or to download the code that that created the page as an Rmarkdown file (we cover Rmarkdown more in-depth in module/zoom notes 6).
R is the language, RStudio is an interface for interacting with R.
RStudio is what is known as an integrated development environment (IDE) which provides a graphical interface (among other things) for interacting with R. The main environment of RStudio is comprised of panels.
Panels of RStudio
Before we get under way we want to set up our environment to help enforce some behaviours that encourage reproducibility.
We want to disable a few default settings
To do this select the Tools
menu ->
Global options
, then on the “General” tab under the
“Workspace” heading:
Projects within RStudio provide mechanism to encapsulate a project - be it particular data analysis you’re working on, an R package you’re making (Modules/Handouts 7 and 8), or even a book (Module/Handout 6).
A project is a directory on your computer that you want to put all your code, documentation, data, and associated outputs so that everything needed to recreate an analysis is “bundled together”.
We’ll create a project for the course and create a common project structure:
Now we have our new project we’ll create some directories to store various aspects of our project. We’ll create the following directories:
You can do this either using the “New Folder” button in the Files panel or using the code below.
dir.create("scripts")
#> Warning in dir.create("scripts"): 'scripts' already exists
dir.create("data")
#> Warning in dir.create("data"): 'data' already exists
dir.create("data")
#> Warning in dir.create("data"): 'data' already exists
dir.create("outputs")
#> Warning in dir.create("outputs"): 'outputs' already exists
Scripts in R are a way in which we can document the steps we take through our analysis and can be used to recreate what we have done. There are two main forms that scripts come in:
R Script - plain text document that contains R commands
RMarkdown - a plain text document that has text areas and code areas which is processed/compiled thought to a output format such as html or pdf which contains formatted text, code plus the results of the code embedded in the document. We’ll cover this in more depth Module/Handout 6.
General syntax info:
'
and double
"
quotes but you have to use the same type to close as you
open with
addition: +
subtraction: -
multiplication: *
division: /
exponent: **
modulo/remainder: %%
# addition
1 + 2
#> [1] 3
# subtraction
3 - 6
#> [1] -3
# multiplication
4 * 2
#> [1] 8
# division
12 / 3
#> [1] 4
# exponent
2 ** 5
#> [1] 32
# modulo
5 %% 2
#> [1] 1
The #
symbol is used to denote a ‘comment’ and R will
ignore everything to the right of it. This lets us add comments to our
code to describe the what and why of the code being run.
In R there are 3 main types of data you’ll likely use in day-to-day life, these are:
The numeric type covers is your numbers both “double” (decimal), and
integers (defined by a integer number followed by L
e.g. 2L
). These can be typed straight in as we have been
doing and can have mathematical operations performed on them.
The character type is any combination of characters zero or more that
is represented on the keyboard and includes letters, numbers and
symbols. Characters are defined in R by enclosing them in quotation
marks (either single or double, e.g. "some words"
).
Characters can’t have mathematical operations performed on them.
Booleans are the logical data type, they consist of TRUE
and FALSE
and are used for logical operations.
Within R there are other types such as complex, and raw but they are less common
Variables are named objects that we can store data in that we can reference/use later in our code.
The main rules for naming variables - Start with a letter - No symbols (except . or _)
It’s also highly recommended to use variable names that are descriptive of the contents. This will help you remember and keep track of things easier later on.
<-
and =
are both used for assignment
within R. <-
is the older/more commonly used for
assigning to a variable, =
is the only one that you use
inside a function call to assign the parameters/arguments.
In terms of priority of operators, assignment happens last, and takes the result of evaluating the right hand side and stores it into the object of named on the left hand side.
Another key thing about variables is they only contain the last thing that was assigned to them.
Functions in R let us perform operations. Functions in R are really R
code stored in a variable that does stuff. In order to “do the stuff” we
need call the variable with parentheses ()
on the end.
Inside the parentheses we can pass values for the parameters/arguments
which let us change the behaviour of the function.
For instance, if we want to find the square root of a number, we can
use the sqrt
function and provide the number we want the
square root for as an argument:
sqrt(64)
#> [1] 8
To supply multiple arguments or parameters to a function we separate them by a comma. Each argument also has a name. Arguments are filled either by name, or based on the position.
# by position
round(3.142, 1)
#> [1] 3.1
# by name
round(x = 3.142, digits = 1)
#> [1] 3.1
To find out what the names of the arguments a function takes we can
run the args
function on the function name:
args(round)
#> function (x, digits = 0)
#> NULL
Being able to find things out for yourself is extremely important and within R there are a couple way this can be done.
help()
or ?
??
There is usually pretty good documentation that goes along side
functions and this can be accessed through the inbuilt help system using
help()
, ?
, or ??
.
help()
and ?
are equivalent and will find
the manual page for the exact function, e.g. to find the manual page for
the mean
function: help("mean")
or
?mean
. The most useful sections of the manual pages are
“Usage”, “Arguments”, and “Examples” which tell you how to run the
function, what the arguments are supposed to be, and gives some
‘copy-paste’ examples to show you how it’s run.
??
on the other hand will bring up a list of all the
manual pages that mention the word.
Google is also extremely useful and my most common search is “how to … in R”.
One final set of syntax operators to cover are the comparatives.
These are used to create logical statements based on comparisons that
will evaluate to be either TRUE
or FALSE
.
Comparators:
==
!=
>
/
>=
<
/
<=
Logical operators
!
|
&
We’ll cover these in more depth in the Selecting and Filtering Module/Handout.
Until now we’ve been operating on single items at a time. R has way of storing multiple pieces of information - the vector.
A vector is zero or more entries of the same data type.
To create a vector of more than one piece of information we can use
the combine function, which because it is used so frequently in
R, has been reduced to be c()
. Each item is comma
separated.
some_letters <- c("a", "b", "c")
some_letters
#> [1] "a" "b" "c"
some_numbers <- c(2, 4, 6)
some_numbers
#> [1] 2 4 6
We can see what the data type is of a vector using
typeof
typeof(some_letters)
#> [1] "character"
typeof(some_numbers)
#> [1] "double"
The final syntax we’ll introduce this week is []
which
enables us to extract items out of a vector.
Each item has a positional index (starting from 1 - other languages
differ). We can use this index as an argument to []
to pull
out the specified item. If we want multiple items we supply a vector of
numbers for the indexes we want
some_numbers
#> [1] 2 4 6
# pull out the second item
some_numbers[2]
#> [1] 4
some_letters
#> [1] "a" "b" "c"
# pull out items 1 and 3
some_letters[c(1,3)]
#> [1] "a" "c"
In the Selecting and Filtering Data module we will cover this more and expand to 2 dimensions.
Create and assign a vector of 5 numbers.
Find out the length
of the vector.
Divide the entire vector by 2 and store the result into a variable called div_2 - what is the result of the division?
Calculate the minimum, maximum, mean, and standard deviation for the div_2 vector. Can you round the results to 2 decimal places?
Create and assign a vector of at least 4 animal names into animals.
Can you find a way to tell you the number of characters per item?
Into a new variable, extract the first and fourth animal.
Remove the third animal from your original animals vector (what does using a negative index do?).
Create a vector that has three copies of this updated animals vector.
Combine your animal and number vectors together into a new
variable called “coerced”. Run typeof
on this
vector - how does it compare to the types of the original
vector?
my_numbers <- c(12, 63, 3, 7, 84)
length(my_numbers)
#> [1] 5
div_2 <- my_numbers / 2
div_2
#> [1] 6.0 31.5 1.5 3.5 42.0
# minimum
round(min(div_2), digits = 2)
#> [1] 1.5
# maximum
round(max(div_2), digits = 2)
#> [1] 42
# mean
round(mean(div_2), digits = 2)
#> [1] 16.9
# standard deviation
round(sd(div_2), digits = 2)
#> [1] 18.57
animals <- c("lion", "tiger", "snake", "beetle", "turtle")
animals
#> [1] "lion" "tiger" "snake" "beetle" "turtle"
nchar(animals)
#> [1] 4 5 5 6 6
two_animals <- animals[c(1,4)]
two_animals
#> [1] "lion" "beetle"
animals <- animals[-3]
animals
#> [1] "lion" "tiger" "beetle" "turtle"
animals3 <- c(animals, animals, animals)
animals3
#> [1] "lion" "tiger" "beetle" "turtle" "lion" "tiger" "beetle" "turtle"
#> [9] "lion" "tiger" "beetle" "turtle"
my_numbers
#> [1] 12 63 3 7 84
typeof(my_numbers)
#> [1] "double"
animals
#> [1] "lion" "tiger" "beetle" "turtle"
typeof(animals)
#> [1] "character"
combined <- c(my_numbers, animals)
typeof(combined)
#> [1] "character"
combined
#> [1] "12" "63" "3" "7" "84" "lion" "tiger" "beetle"
#> [9] "turtle"
# the numbers have now been coerced to be a character type so that
# all elements of the vector are the same data type