Associated Material
Zoom notes: Zoom notes 07 - Functions
and Choices
Readings:
Introduction
As the work you do in R becomes more complex, your R scripts will get
longer, and may become difficult to manage. You may find that it is
difficult to locate particular bits of code, or that you seem to be
writing the same, or very similar bits of code, over and over again. If
you want to modify those bits, there are multiple place in the script
you have to edit, typos start to creep in, and the whole process becomes
unpleasant.
To prevent this, large programs need to be organised into
logical modules. Sections of code that perform a
clearly delineated task can be enacapsulated and named. The encapsulated
code can be invoked simply by typing the name – no need to cut and
paste, no need to modify multiple copies. Scripts are shorter and easier
to manage. Code that is organised this way is said to be
modular. Modular code is better code.
In R, the main code module is the function. We have
already used many functions, like read.csv
and
sqrt
and ggplot
. We invoke them in our code as
single commands, but each of these commands actually ecapsulates many,
many (add another “many” for ggplot) lines of code. We call the function
by name, and all the associated code is executed.
Many functions in R will display their code in the console. Type the
function name without the () into the console, and the
encapsulated commands will be shown. (Look at the code for function
filter
.)
To make our own code modular, we can define our own functions. We
write the code for a function somewhere in our script file. We can then
call the function by name anywhere else in our script
file, and all the code is executed. Just as with built-in functions, we
can pass data into our function, and it can operate on those data. We
can also arrange for our function to return a result
that we can store in a variable.
Function Declaration Syntax
A function in R is comprised of four parts:
- a name
- the body (the code that does something)
- (optionally) input data
- (optionally) an output that can be stored
Creating a function
We will begin with the simplest function, one that accepts no input
data, and returns no result. We declare the function
using keyword function
. Schematically, a user-defined
function is created as:
name_of_function <- function()
{
Code Body
}
The curly braces enclose the code body.
Subesquently we call the function as:
name_of_function()
When the call is executed, all the commands in the code body are
run.
For example, in the current public health situation, we often see
discussions of what body temperature constitutes a fever. We might want
to be able to translate that temperature from Celsius to Fahrenheit (as
used in the North American literature). This is a logically
well-delineated computation, so we would encapsulate it in a
function.
# Declare/define the function
fever_in_fahrenheit <- function()
{
fever_in_celsius <- 37.5
converted_to_fahrenheit <-(fever_in_celsius*9/5) + 32
print(converted_to_fahrenheit)
}
# Call a user-defined function by name
fever_in_fahrenheit()
#> [1] 99.5
Some things to note:
- Typing out the function declaration is not enough to make the
function available for calling. The declaration code (from the start of
the name to the closing curly bracket, inclusive) must first be
executed. Execute this code in a script as we always do, by
selecting all the code (with the mouse) and typing ctrl-Enter (Windows)
or cmd-Enter (Mac). This is called sourcing the
function.
- When you source a function, nothing happens. In our
example, you will not see the output of the print statement when you
execute the declaration code. Sourcing a function declaration
does not run the code body. It simply parses
the code body and, if there are no errors, stores it in the environment.
Effectively, it informs RStudio that a function with this name exists,
and defines the code which it encapsulates, so it can be called
later.
- To execute the code body, state the name of the function followed by
(), with no intervening spaces. This calls the
function; we think of () as the call operator. Whether
the code body contains 1 line or 1,000 lines, or more, calling the
function by name runs all the encapsulated code.
In R a function cannot be called until after the function
has been sourced. Consider the following example:
# If you try to call the function before it is declared and sourced
fever_in_fahrenheit()
#> Error in fever_in_fahrenheit(): could not find function "fever_in_fahrenheit"
fever_in_fahrenheit <- function()
{
fever_in_celsius <- 37.5
converted_to_fahrenheit <-(fever_in_celsius*9/5) + 32
print(converted_to_fahrenheit)
}
There are more advanced techniques that allow us to call functions
contained in other script files where we cannot directly select and
execute the declaration code. See the reading for discussion.
Providing data to a function
Consider the following function, which demonstrates how to compute
BMI (body mass index) in R, using the given values for height (in
metres) and weight (in kg):
# Declare function calc_BMI
calc_BMI <- function()
{
weight <- 73
height <- 1.68
bmi <- weight/height^2 # ^ is the exponentiation operator
print(bmi)
}
# Call function calc_BMI
calc_BMI()
#> [1] 25.86451
This function contains nice tidy code and is mathematically correct,
but unless we happen to want the BMI of a person with exactly this
height and weight, it is of no use to us. A useful function defines the
computation (in this case taking the ratio of weight to height squared),
and accepts the data values when it is called.
We have seen this many times when writing R code, when we call a
function repeatedly on different input values:
sqrt(14)
#> [1] 3.741657
sqrt(820)
#> [1] 28.63564
sqrt(0.65)
#> [1] 0.8062258
To declare a function that can accept input arguments, perform these
steps:
- Between the round brackets that follow the keyword
function
, place a variable name for each
piece of data you wish to input when the function is called.
- In the function body refer to the input variables by the
name you placed between the round brackets. You DO NOT need to
initialise the variables inside the code body – in fact you MUST NOT do
so. Variables with those names are automatically created for you when
the function is called.
- When calling the function, provide a value for each
input variable. You do not need to repeat the variable name, just
provide values, separated by commas, in the same order as the variables
are listed in the declatration.
- When the function is called, the system creates the input variables,
assigns them the corresponding values from the call statement, and
executes the code body, using those variables.
We can modify our calc_BMI
function to accept weight and
height as input variables, as shown. Compare this version to the earlier
version. Note that we do not declare and initialise weight and
height inside the code body.
# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
# Use input arguments weight and height. DON'T initialise them.
bmi <- weight/height^2
print(bmi)
}
# Call function calc_BMI
calc_BMI(73, 1.68)
#> [1] 25.86451
Following the rules
Declaring a function with inputs defines the required
interface of the function. That is, it defines what you
have to provide if you want to call the function. If the caller violates
the interface, the code is not guaranteed to work. For example, function
calc_bmi
is declared with two input arguments. Therefore,
all of these calls violate the interface.
calc_BMI()
#> Error in calc_BMI(): argument "weight" is missing, with no default
calc_BMI(16)
#> Error in calc_BMI(16): argument "height" is missing, with no default
calc_BMI(73, 168, 42)
#> Error in calc_BMI(73, 168, 42): unused argument (42)
In some programming languages, one is required to specify the data
type (e.g. number, string, data frame, etc.) of each input argument.
Code that tries to pass in the wrong type of data will not compile. In R
this is not required. R will try to run your code no matter what sort of
data it gets. However, if it tries to operate on the wrong type of data,
it will throw an error:
calc_BMI(42, "fred")
#> Error in height^2: non-numeric argument to binary operator
Naturally, R doesn’t understand how to apply the exponentiation
operator to “fred”, and it is telling you so.
More on the rules
Consider the following code:
# Declare the function
compute_area <- function(width)
{
area <- width * height
print(area)
}
# Call the function
compute_area(35)
The function compute_area
is defined with one argument.
Therefore, when it is called, we must provide one value between the
round brackets. In the call, we have correctly provided one argument, of
the correct data type. Will the call compute_area(35)
work
correctly, or will it throw an error? What error will it throw?
# Declare the function
compute_area <- function(width)
{
area <- width * height
print(area)
}
# Call the function
compute_area(35)
#> Error in compute_area(35): object 'height' not found
When compute_area(35)
is executed, it produces an error
message: ...object 'height' not found
. In the code body of
compute_area
, we refer to an entity height
.
Since that entity is not surrounded by double-quotes, R expects to find
a variable named height existing in the environment,
and no such variable exists, because we have neither created one
directly (with an assignment statement) nor passed one in as an argument
to the function.
In the same line of code, we also refer to an entity
width
. Note that R does not complain about being unable to
find width
. That is because we defined an input argument
called width.
- How would you modify function
compute_area
to eliminate
the error?
- How would this change the form of the call to
compute_area
?
Taking your time
Traditionally, new programmers find the syntax of argument passing
extremely confusing. There are too many interacting parts: we
have variable names in the declaration, variables used in the code body,
and values passed in the call. At first exposure, it can be unclear how
all these parts work together. If this is the first time you have seen
this syntax, take some time to experiment with it to solidify your
understanding. Make up some simple user defined functions of your own to
practice managing input data.
Getting output from a function
In all of our examples so far, we have used a print statement to show
the result of the function’s computation. This is, of course, not
adequate in real programming, as a print statement simply writes to the
console. The computed value is not available for use later in our
script. We want to be able to save the result of a function into a
variable, as we have done with the built-in functions we have used
earlier in the course. For example, we have said
x <- sqrt(22)
, creating a variable called x
that stores the square root of 22. We could then use x
in
subsequent computations, as needed.
When a function makes its result available for storage, we say it is
returning its result. The command sqrt(22)
returns the square root of 22, and we can store it in a
variable using the assignment statement.
We can make our user-defined function behave in the same way, by
using return
instead of print
. For example, we
can modify calc_bmi
:
# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
bmi <- weight/height^2 # ^ is the exponentiation operator
return(bmi)
}
With this change to the function, we can store the result of the BMI
computation in a variable for later use.
# Save the return value in a variable.
bmi_result <- calc_BMI(75, 1.75)
# Display by name, as we always can do in R with variables
bmi_result
#> [1] 24.4898
CAUTION
In R, the return
keyword is actually optional. By
default, R functions just return the value of their last line of code.
So technically, you can write the function above like this without
changing its behaviour:
# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
bmi <- weight/height^2 # ^ is the exponentiation operator
# Omitting the explicit call to return. Don't do this.
bmi
}
You will see this shortcut used in R code in the wild. However, it is
an old-fashioned syntax, and it can lead to subtle errors in complex R
programs. We suggest that you avoid it. Make the behaviour of your
functions clear by explicitly identifying the return value with function
return
.
Function exercise
Returning to the Palmers Penguins data, and using the techniques we
have seen for performing descriptive statistics, consider the following
code, which computes summaries of the body mass measure (i.e. the
dependent variable in this summary is column body_mass_g).
library(palmerpenguins)
dv_vector <- penguins$body_mass_g
mean_dv <- mean(dv_vector, na.rm = TRUE)
sd_dv <- sd(dv_vector, na.rm = TRUE)
min_dv <- min(dv_vector, na.rm = TRUE)
max_dv <- max(dv_vector, na.rm = TRUE)
This code fragment creates four new variables that we could use in
later computations, or in generating reports.
Imagine that we wish to do the same summary for flipper length. We
could copy and paste the code, changing the line where we assign
variable dv_vector
. Then if we wanted to do the same
summary on bill length, we could copy and paste the code again, and
again change the assignment to dv_vector
. By now, we are
bored of this, and our script is getting needlessly big and messy. If we
later decide we need to include the median, we will need to go back and
add the median
command multiple times, greatly increasing
the chance of errors (and also being boring).
Computing this set of descriptive stats is a logically delineated
task, and we should therefore encapsulate it into a user defined
function. Each time we do this summary, the computation (the four calls
to mean
, sd
, min
and
max
) remains the same, but the data we wish to process
changes. Each time we call the function, we want to be able to provide
it with the data to process. We must therefore pass the data in as an
input argument.
Get output from the function
As discussed earlier, displaying function output via print statements
is of limited utility; it is preferable to return the results of a
function, so it can be stored in a variable. We would like, therefore,
to return the output of function desc_stats
. Unfortunately,
a function in R can only return a single data object
and our function computes four values.
To resolve this, we can bundle up our four computed values into a
single data object, called a list. An R
list is like a vector, except that each element has a
name as well as an ordinal position, and elements can be retrieved
(using the [] operator) either by name or position. Consider this
example:
# Create a list
character_list <- list(Name = "Snoopy", Breed = "Beagle", Owner = "Charlie Brown")
# Display the entire list
character_list
#> $Name
#> [1] "Snoopy"
#>
#> $Breed
#> [1] "Beagle"
#>
#> $Owner
#> [1] "Charlie Brown"
# Some examples of selection from the list
character_list["Name"]
#> $Name
#> [1] "Snoopy"
character_list["Breed"]
#> $Breed
#> [1] "Beagle"
character_list[2]
#> $Breed
#> [1] "Beagle"
character_list[3]
#> $Owner
#> [1] "Charlie Brown"
character_list$Name
#> [1] "Snoopy"
character_list$Owner
#> [1] "Charlie Brown"
A list is considered a single data object, so it can be returned from
a function via a return
statement. Modify function
desc_stats
to return its four outputs,
rather than printing them. Call your function and use any of the
syntactic options above to explore the contents of the returned
object.
My solution is:
# Modify the function to return a list
desc_stats <- function(dv_vector)
{
mean_dv <- mean(dv_vector, na.rm = TRUE)
sd_dv <- sd(dv_vector, na.rm = TRUE)
min_dv <- min(dv_vector, na.rm = TRUE)
max_dv <- max(dv_vector, na.rm = TRUE)
result_list <- list(Mean = mean_dv,
Sd = sd_dv,
Min = min_dv,
Max = max_dv)
return(result_list)
}
flipper_desc_stats <- desc_stats(penguins$flipper_length_mm)
flipper_desc_stats$Mean
#> [1] 200.9152
flipper_desc_stats[2]
#> $Sd
#> [1] 14.06171
min_max <- c(flipper_desc_stats["Min"], flipper_desc_stats["Max"])
print(min_max)
#> $Min
#> [1] 172
#>
#> $Max
#> [1] 231
Scope
When you specify a variable in R it will start trying to find
something with that name within the global environment (displayed in the
Environment tab in RStudio). In the case of functions, any
variable defined in the function (including through its arguments) stays
within the function (a separate local environment specific to the
function). If however in the body of the function you refer to a
variable that hasn’t been defined in the function, R will start looking
at the global environment and if it finds a variable of the same name
you’ve created outside of the function, it will use the value that is
stored within it. This behaviour can cause issues.
Here is an example, where the function needs a value for
n
but it hasn’t been supplied as an argument and there is
no default value.
# multiplies the number x by the number n
multiply_by_n <- function(x){
x * n
}
multiply_by_n(x = 3)
#> Error in multiply_by_n(x = 3): object 'n' not found
In the function body we referred to n
which wasn’t
defined anywhere so we got an error.
Let’s use the same function again but define n
outside
the function:
# multiplies the number x by the number n
multiply_by_n <- function(x){
x * n
}
# define n in the global environment
n <- 10
multiply_by_n(x = 3)
#> [1] 30
This time R looks for n
inside the body but doesn’t find
it and when it looks into the global environment it finds a variable
named n
and so uses that value.
This time we’re going to modify the function to take a second
argument called n
, and also have n
defined in
the global environment:
# multiplies the number x by the number n
multiply_by_n <- function(x, n){
x * n
}
# define n in the global environment
n <- 10
multiply_by_n(x = 3, n = 2)
#> [1] 6
The value of n
that was used was the value supplied as
the argument, rather than the version that was defined in the global
environment. Thus a function will use a locally defined variable when
one exists, but if it can’t find one, it will look in the global
environment.
This behaviour can lead to subtle errors in your code. For example,
earlier in this module we wrote a version of user-defined function
calc_area
that tried to reference variable
height
without having passed it in as an argument. When R
was unable to find height
, it alerted us with an error
message, and we were able to identify and correct the error in the code
(the missing input argument height
). If, at that point, we
had already created a variable height
outside the function
(for any reason), R would have happily used that existing
variable when executing the function code body and we would not
have realised there was an error. Of course, if we ever called the bad
function when we hadn’t previously created height
, our
program would crash. Some programmers use special variable coding styles
– for example prefixing all input arguents with underscore – to prevent
confusion between local and global variables. It is very important, when
working in R to be constantly aware of which variables exist in the
global environment. Keep your eye on the Environment tab as you write
functions.
Complex Program Behaviour
In very simple programs, we write a set of commands which are
executed in sequence – first statement, second statement, third
statement, etc. Each time we run the code, the exact same set of
commands are executed, in the exact same sequence.
However, as computational problems become more complex, the behaviour
of our code solutions must also become more complex. We may have
sections of code that we want to repeat some varying number times
depending on our data, or sections of code that we only want to execute
under certain conditions.
The specific time course of program execution is called flow
of control, and R provides many syntactic features that allow
us to manage it. Flow of control generally depends on the program’s
state – the specific set of variables and their values
at each point during program execution.
For example, when we import a csv file into a data frame, that data
frame is part of the state. We may want to execute a function once for
each row in the data frame. If we import a data frame with 53 rows, we
want the function to be called 53 times. Thus the number of times the
function is called (the flow of control) depends on the state.
Similarly, when we write a function that accepts data input arguments,
the value of those arguments passed in at function call are part of the
state, and can be different each time the function is called. Frequently
we will want to take different actions in the function depending on that
state.
We have described the three primary flow of control constructs:
- Sequential execution: Statements are executed in order
- Iteration: A set of statements is repeated a number of times
- Conditional execution: A set of statements may or may not be
executed, depending on the state.
We have already seen, and written, code that contains only sequential
exection. We will now look at the syntactic tools for conditional
execution. In our next module, we will cover iteration, which is
syntactically more complex.
If statements
All modern programming languages allow you to wrap a block of code in
an if statement. If statements contain a
condition, which is an expression which evaluates to
either true or false. At runtime, the condition is evaluated. If it is
true, the block of code is executed; if it is false, the block of code
is not executed. Conditional code blocks in R are delineated with curly
brackets. The conditional expression is delineated with round
brackets.
Schematically:
if (condition) {
# code here is only run if condition was TRUE
}
Consider this toy example:
did_I_pass_the_paper <- function(my_mark)
{
if (my_mark > 50) {
print("You passed!")
}
}
# This call generates no output
did_I_pass_the_paper(14)
# This call generates output
did_I_pass_the_paper(73)
#> [1] "You passed!"
Adding an alternative with else
Frequently we wish to define two behaviours for a conditional
expression – one for when it is true and another for when it is false.
In R we do this with the keyword else
if (condition) {
# code here is only run if condition was TRUE
} else {
# code here is only run if condition was FALSE
}
For example:
did_I_pass_the_paper <- function(my_mark)
{
if (my_mark > 50) {
print("You passed! :)")
} else {
print("Sorry, you didn't pass. :(")
}
}
# This call runs the else block
did_I_pass_the_paper(14)
#> [1] "Sorry, you didn't pass. :("
# This call runs the if block
did_I_pass_the_paper(73)
#> [1] "You passed! :)"
NB: Pay careful attention to the placement of the curly brackets for
both the if block and the else block. The first curly bracket must
sit on the same line as the if statement. The else
keyword must be on the same line as the closing curly bracket of the
if
block, and must be followed, on that same line, by the
opening curly bracket of the else
block. It is a known
peculiarity of the R language that it is extremely
fussy about this rule. No use fighting it; just follow the
rule.
An extension of else
is the else if
construct that lets you link a series of conditions. The conditions are
tested one at a time from the top and the first condition that evaluates
to TRUE
is the only code block that gets run. For
example:
bmi_category <- function(bmi)
{
if(bmi > 30){
print("obese")
} else if (bmi > 25){
print("overweight")
} else if (bmi > 20){
print("healthy")
} else {
print("underweight")
}
}
bmi_category(22)
#> [1] "healthy"
bmi_category(18)
#> [1] "underweight"
Conditional statements can be nested. That is, inside the
if
or else
block, you can have more
conditional statements, each of which can have if
and
else
blocks, each of which can in turn have nested
conditional statements, and so on. For complex computations, the
conditional logic can become very elaborate, and needs to be approached
carefully. I find it helpful in these cases to sketch out a flow chart
showing all the different outcomes based on state, and use that as a
pattern for writing and arranging the various code blocks.
Complex conditional expressions
In our previous examples, we wrote conditional expressions using the
> (greater than) operator. The expression
my_mark > 50
evaluates to either true or false (i.e. a
boolean value). If variable my_mark
is
greater than 50, the expression returns true, otherwise it returns
false. Greater than is a comparison operator. The R
comparison operators are:
== |
equal to |
!= |
not equal to |
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
The Boolean logic operators can be used in to modify
or combine conditional expressions.
For example, the following function might be used to check that a
value entered as a penguin body mass was within the expected weight
range for the species.
# Chinstrap penguins weight between 3 and 5 kg
check_chinstrap_weight <- function(weight_to_check)
{
if ((weight_to_check >= 3000) & (weight_to_check <= 5000)) {
print("Weight ok")
} else {
print("That's probably a typo")
}
}
check_chinstrap_weight(4100)
#> [1] "Weight ok"
check_chinstrap_weight(410)
#> [1] "That's probably a typo"
The result of the NOT, AND, and OR are shown in the logic table:
!TRUE |
FALSE |
|
|
!FALSE |
TRUE |
|
|
TRUE & TRUE |
TRUE |
|
|
TRUE & FALSE |
FALSE |
|
|
FALSE & TRUE |
FALSE |
|
|
FALSE & FALSE |
FALSE |
|
|
TRUE | TRUE |
TRUE |
|
|
TRUE | FALSE |
TRUE |
|
|
FALSE | TRUE |
TRUE |
|
|
FALSE | FALSE |
FALSE |
|
|
Conditional exercise
Write and test a function that determines whether a student receives
a passing grade on an assessment. The function should accept two input
args: the number of marks earned, and the total number of marks possible
for the assessment. The student must earn 50% of the available marks in
order to pass. For example, if a student earns 18 marks on a 20 mark
assessment they pass, but if they earn only 8 marks, they fail. Your
function should print “Pass” or “Fail” as appropriate based on the input
data.
pass_check <- function(earned, possible)
{
mark <- earned/possible
if (mark > 0.5){
print("Pass")
} else {
print("Fail")
}
}
pass_check(18, 20)
#> [1] "Pass"
pass_check(8,20)
#> [1] "Fail"
pass_check(18,100)
#> [1] "Fail"
pass_check(8,10)
#> [1] "Pass"
Function Error Checking
We can use conditional control statements to provide error checking
in user-defined functions.
Failing
One of the sayings in programming is “if it’s going to fail, it’s
best to fail early”. That is, if we know that our function requires a
specific input data type, we want to program
defensively so that our function “fails” before it
encounters the error. As part of our defensive programming we can
provide informative error messages, rather than rely on R’s generic
ones.
In the following example, we check that the data coming into our
function is numeric. If it is not, we use function stop
, to
exit the function, displaying our informative message.
# Returns the provided number doubled
double_number <- function(x) {
if( !is.numeric(x) ){
stop("x needs to be a number.")
}
x * 2
}
double_number(4)
#> [1] 8
double_number("a")
#> Error in double_number("a"): x needs to be a number.
N.B. Check the appendix for more on data types.
Conclusion
This document has presented an introduction to creating your own
functions and implementing conditional flow of control. For more detail,
see the free online books Advanced
R and R packages.
Appendix
Data types
Not all data are created equal, in R this concept is captured by
data types. For a vector, all values must be of a single data
type.
The main data types that you will encounter in R are:
- Logical (
c(TRUE, FALSE)
)
- Numeric - also called Real or Double
(Numbers that have a floating point (decimal) representation
e.g.
c(1, 3.6, 1e3)
)
- Character - also called String (anything inside
matching opening and closing quotes (single or double)
e.g.
c("a", "some words", "animal"
))
There are 3 other less common:
- Integer (integers
c(1L, 4L, -3L)
)
- Complex (Complex numbers
e.g.
c(0+3i, 4i, -2-5i)
)
- Raw (the bytes of a file)
Each data type is known in R as an atomic vector. R has
built in functions to be able to determine the data type of a vector,
typeof()
is the best one to use, but others such as
str()
and class()
can be used.
There is also a series of functions that let us do explicit checking
for a data type which will return TRUE
or
FALSE
:
is.logical()
is.numeric()
or is.double()
is.character()
is.integer()
is.complex()
is.raw()
Data Type Coercion
In R, when doing operations on multiple vectors, they all need to be
the same data type - but how can this work if we have for example a
numeric vector and a character vector? Coercion is how R deals
with trying to operate on two vectors of different data types. What this
means in practice is that R will convert the data type of a vector in a
defined manner such that we end up will all of the same type and follows
a “lowest common data type” approach. Using the 3 main data types from
above, the following is the order in which they will be coerced into the
next data type: logical -> numeric ->
character.
This principle applies when you try to create a vector of mixed data
types too, R will coerce everything until it is a single data type.
See if you can predict what data type the result will be (you can
check by using typeof()
:
# logical and numeric
c(4, TRUE, 5)
# numeric and character
c(1, 3, "A")
# logical and character
c(FALSE, "cat","frog")
# mixed
c("see", TRUE, 4.8)
# tricky
c("1.3", "4", TRUE)
We can also explicitly force coercion into a particular data type by
using the following:
as.logical()
as.numeric()
as.character()
The other data types also have similarly named functions. When going
against the normal direction of coercion, it is important to realise
that if your data doesn’t have a representation in that data type, it
will become NA (missing).
---
title: "Functions and Choices"
date: "Semester 2, 2022"
output:
  html_document:
    toc: true
    toc_float: true
    toc_depth: 3
    code_download: true
    code_folding: show
---

```{r setup, include=FALSE}
library(knitr)

knitr::opts_chunk$set(
  comment = "#>",
  fig.path = "figures/07/", # use only for single Rmd files
  collapse = TRUE,
  echo = TRUE
)


```

> #### Associated Material
>
> Zoom notes: [Zoom notes 07 - Functions and Choices](zoom_notes_07.html)
> 
> Readings:
>
> - [R for Data Science - Chapter 19](https://r4ds.had.co.nz/functions.html)

\

# Introduction

As the work you do in R becomes more complex, your R scripts will get longer, and may become difficult to manage. You may find that it is difficult to locate particular bits of code, or that you seem to be writing the same, or very similar bits of code, over and over again. If you want to modify those bits, there are multiple place in the script you have to edit, typos start to creep in, and the whole process becomes unpleasant.

To prevent this, large programs need to be organised into **logical modules**. Sections of code that perform a clearly delineated task can be enacapsulated and named. The encapsulated code can be invoked simply by typing the name -- no need to cut and paste, no need to modify multiple copies. Scripts are shorter and easier to manage. Code that is organised this way is said to be **modular**. Modular code is better code.

In R, the main code module is the **function**. We have already used many functions, like `read.csv` and `sqrt` and `ggplot`. We invoke them in our code as single commands, but each of these commands actually ecapsulates many, many (add another "many" for ggplot) lines of code. We call the function by name, and all the associated code is executed.

Many functions in R will display their code in the console. Type the function name **without the ()** into the console, and the encapsulated commands will be shown. (Look at the code for function `filter`.)

To make our own code modular, we can define our own functions. We write the code for a function somewhere in our script file. We can then **call** the function by name anywhere else in our script file, and all the code is executed. Just as with built-in functions, we can pass data into our function, and it can operate on those data. We can also arrange for our function to **return a result** that we can store in a variable. 


# Function Declaration Syntax

A function in R is comprised of four parts:

1. a name
2. the body (the code that does something)
3. (optionally) input data
4. (optionally) an output that can be stored


## Creating a function

We will begin with the simplest function, one that accepts no input data, and returns no result. We **declare** the function using keyword `function`. Schematically, a user-defined function is created as:

```
name_of_function <- function()
{
   Code Body
}
``` 

The curly braces enclose the code body.


Subesquently we **call** the function as:

*name_of_function*()

When the call is executed, all the commands in the code body are run.

For example, in the current public health situation, we often see discussions of what body temperature constitutes a fever. We might want to be able to translate that temperature from Celsius to Fahrenheit (as used in the North American literature). This is a logically well-delineated computation, so we would encapsulate it in a function.

```{r declare a function}
# Declare/define the function
fever_in_fahrenheit <- function()
{
  fever_in_celsius <- 37.5
  converted_to_fahrenheit <-(fever_in_celsius*9/5) + 32
  print(converted_to_fahrenheit)
}
```

```{r call}
# Call a user-defined function by name
fever_in_fahrenheit()
```

Some things to note:

-  Typing out the function declaration is not enough to make the function available for calling. The declaration code (from the start of the name to the closing curly bracket, inclusive) **must first be executed**. Execute this code in a script as we always do, by selecting all the code (with the mouse) and typing ctrl-Enter (Windows) or cmd-Enter (Mac). This is called **sourcing** the function.
- When you source a function, **nothing happens**. In our example, you will not see the output of the print statement when you execute the declaration code. Sourcing a function declaration **does not run the code body**. It simply *parses* the code body and, if there are no errors, stores it in the environment. Effectively, it informs RStudio that a function with this name exists, and defines the code which it encapsulates, so it can be called later.
- To execute the code body, state the name of the function followed by (), with no intervening spaces. This **calls** the function; we think of () as the **call operator**. Whether the code body contains 1 line or 1,000 lines, or more, calling the function by name runs all the encapsulated code.


In R a function cannot be called until *after* the function has been sourced. Consider the following example:

```{r clear, echo=FALSE}
rm(list = ls())
```

```{r wrong order, error=TRUE}

# If you try to call the function before it is declared and sourced
fever_in_fahrenheit()

fever_in_fahrenheit <- function()
{
  fever_in_celsius <- 37.5
  converted_to_fahrenheit <-(fever_in_celsius*9/5) + 32
  print(converted_to_fahrenheit)
}

```

There are more advanced techniques that allow us to call functions contained in other script files where we cannot directly select and execute the declaration code. See the reading for discussion.


## Providing data to a function

Consider the following function, which demonstrates how to compute BMI (body mass index) in R, using the given values for height (in metres) and weight (in kg):

```{r calc BMI}

# Declare function calc_BMI
calc_BMI <- function()
{
  weight <- 73
  height <- 1.68
  bmi <- weight/height^2 # ^ is the exponentiation operator
  
  print(bmi)
}

# Call function calc_BMI
calc_BMI()

```
This function contains nice tidy code and is mathematically correct, but unless we happen to want the BMI of a person with exactly this height and weight, it is of no use to us. A useful function defines the computation (in this case taking the ratio of weight to height squared), and **accepts the data values** when it is called.

We have seen this many times when writing R code, when we call a function repeatedly on different input values:

```{r built in inputs}

sqrt(14)

sqrt(820)

sqrt(0.65)

```

To declare a function that can accept input arguments, perform these steps:

1. Between the round brackets that follow the keyword `function`, place a **variable name** for each piece of data you wish to input when the function is called.
2. In the function body refer to the input variables **by the name you placed between the round brackets**. You DO NOT need to initialise the variables inside the code body -- in fact you MUST NOT do so. Variables with those names are automatically created for you when the function is called. 
3. When calling the function, provide **a value** for each input variable. You do not need to repeat the variable name, just provide values, separated by commas, in the same order as the variables are listed in the declatration.
4. When the function is called, the system creates the input variables, assigns them the corresponding values from the call statement, and executes the code body, using those variables.

We can modify our `calc_BMI` function to accept weight and height as input variables, as shown. Compare this version to the earlier version. Note that we *do not* declare and initialise weight and height inside the code body.

```{r inputs}
# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
  # Use input arguments weight and height. DON'T initialise them.
  bmi <- weight/height^2
  
  print(bmi)
}

# Call function calc_BMI
calc_BMI(73, 1.68)

```

### Following the rules

Declaring a function with inputs defines the required **interface** of the function. That is, it defines what you have to provide if you want to call the function. If the caller violates the interface, the code is not guaranteed to work. For example, function `calc_bmi` is declared with two input arguments. Therefore, all of these calls violate the interface.

```{r bad args 1, error=TRUE}
calc_BMI()
```

```{r bad_args 2, error=TRUE}
calc_BMI(16)
```

```{r bad args 3, error=TRUE}
calc_BMI(73, 168, 42)
```

In some programming languages, one is required to specify the data type (e.g. number, string, data frame, etc.) of each input argument. Code that tries to pass in the wrong type of data will not compile. In R this is not required. R will try to run your code no matter what sort of data it gets. However, if it tries to operate on the wrong type of data, it will throw an error:


```{r bad args 4, error=TRUE}
calc_BMI(42, "fred")
```

Naturally, R doesn't understand how to apply the exponentiation operator to "fred", and it is telling you so.

### More on the rules

Consider the following code:

```{r args 2a, eval = FALSE}
# Declare the function
compute_area <- function(width)
{
  area <- width * height
  print(area)
}

# Call the function
compute_area(35)
```


The function `compute_area` is defined with one argument. Therefore, when it is called, we must provide one value between the round brackets. In the call, we have correctly provided one argument, of the correct data type. Will the call `compute_area(35)` work correctly, or will it throw an error? What error will it throw?

```{r args 2b, error = TRUE}
# Declare the function
compute_area <- function(width)
{
  area <- width * height
  print(area)
}

# Call the function
compute_area(35)
```

When `compute_area(35)` is executed, it produces an error message: `...object 'height' not found`. In the code body of `compute_area`, we refer to an entity `height`. Since that entity is not surrounded by double-quotes, R expects to find a variable named **height** existing in the environment, and no such variable exists, because we have neither created one directly (with an assignment statement) nor passed one in as an argument to the function. 

In the same line of code, we also refer to an entity `width`. Note that R does not complain about being unable to find `width`. That is because we defined an input argument called **width**. 

1. How would you modify function `compute_area` to eliminate the error?
2. How would this change the form of the call to `compute_area`?


### Taking your time

Traditionally, new programmers find the syntax of argument passing *extremely* confusing. There are too many interacting parts: we have variable names in the declaration, variables used in the code body, and values passed in the call. At first exposure, it can be unclear how all these parts work together. If this is the first time you have seen this syntax, take some time to experiment with it to solidify your understanding. Make up some simple user defined functions of your own to practice managing input data.


## Getting output from a function

In all of our examples so far, we have used a print statement to show the result of the function's computation. This is, of course, not adequate in real programming, as a print statement simply writes to the console. The computed value is not available for use later in our script. We want to be able to save the result of a function into a variable, as we have done with the built-in functions we have used earlier in the course. For example, we have said `x <- sqrt(22)`, creating a variable called *x* that stores the square root of 22. We could then use `x` in subsequent computations, as needed.

When a function makes its result available for storage, we say it is **returning** its result. The command `sqrt(22)` **returns** the square root of 22, and we can store it in a variable using the assignment statement.

We can make our user-defined function behave in the same way, by using `return` instead of `print`. For example, we can modify `calc_bmi`:

```{r return}
# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
  bmi <- weight/height^2 # ^ is the exponentiation operator
  
  return(bmi)
}
```

With this change to the function, we can store the result of the BMI computation in a variable for later use.

```{r saving a return}

# Save the return value in a variable.
bmi_result <- calc_BMI(75, 1.75)

# Display by name, as we always can do in R with variables
bmi_result
```

### CAUTION

In R, the `return` keyword is actually optional. By default, R functions just return the value of their last line of code. So technically, you can write the function above like this without changing its behaviour:

```{r blank return}
# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
  bmi <- weight/height^2 # ^ is the exponentiation operator
  
  # Omitting the explicit call to return. Don't do this.
  bmi
}
```

You will see this shortcut used in R code in the wild. However, it is an old-fashioned syntax, and it can lead to subtle errors in complex R programs. We suggest that you avoid it. Make the behaviour of your functions clear by explicitly identifying the return value with function `return`.


## Function exercise

Returning to the Palmers Penguins data, and using the techniques we have seen for performing descriptive statistics, consider the following code, which computes summaries of the body mass measure (i.e. the dependent variable in this summary is  column *body_mass_g*).

```{r summarising, warning=FALSE, message=FALSE}
library(palmerpenguins)

dv_vector <- penguins$body_mass_g

mean_dv <- mean(dv_vector, na.rm = TRUE)
sd_dv <- sd(dv_vector, na.rm = TRUE)
min_dv <- min(dv_vector, na.rm = TRUE)
max_dv <- max(dv_vector, na.rm = TRUE)

```


This code fragment creates four new variables that we could use in later computations, or in generating reports. 

Imagine that we wish to do the same summary for flipper length. We could copy and paste the code, changing the line where we assign variable `dv_vector`. Then if we wanted to do the same summary on bill length, we could copy and paste the code again, and again change the assignment to `dv_vector`. By now, we are bored of this, and our script is getting needlessly big and messy. If we later decide we need to include the median, we will need to go back and add the `median` command multiple times, greatly increasing the chance of errors (and also being boring).

Computing this set of descriptive stats is a logically delineated task, and we should therefore encapsulate it into a user defined function. Each time we do this summary, the computation (the four calls to `mean`, `sd`, `min` and `max`) remains the same, but the data we wish to process changes. Each time we call the function, we want to be able to provide it with the data to process. We must therefore pass the data in as an input argument.

### Make a function with input arguments

Convert this code into a function that accepts a single input argument, which will be the vector of data to be analysed. For now, use print statements to display the results, not return statements, so we can concentrate on getting the data in. Call your function `desc_stats`.

```{r summarising function}
library(palmerpenguins)

desc_stats <- function(dv_vector)
{
  mean_dv <- mean(dv_vector, na.rm = TRUE)
  sd_dv <- sd(dv_vector, na.rm = TRUE)
  min_dv <- min(dv_vector, na.rm = TRUE)
  max_dv <- max(dv_vector, na.rm = TRUE)

  print(mean_dv)
  print(sd_dv)
  print(min_dv)
  print(max_dv)
}  
```

Source the function by selecting all the code (from before `desc_stats` down to and including the closing curly bracket) and typing ctrl-Enter or cmd-Enter. Test your function by calling it on the body mass data, the flipper length data, and the bill length data.

```{r calling}
desc_stats(penguins$body_mass_g)
desc_stats(penguins$flipper_length_mm)
desc_stats(penguins$bill_length_mm)
```

Stop to admire how tidy and succinct your code is. Note that if you now decide to add the median, you only need to make the change in one place -- in the function declaration itself -- regardless of how many different data sets you have processed.

**NB: If you do change the function, you must source the function again (i.e. you must repeat the "select all the function code and type ctrl-enter" step) to update R's stored copy of the function. You will not see any change in behaviour until the modified function code is sourced.**

### Get output from the function

As discussed earlier, displaying function output via print statements is of limited utility; it is preferable to return the results of a function, so it can be stored in a variable. We would like, therefore, to return the output of function `desc_stats`. Unfortunately, a function in R **can only return a single data object** and our function computes *four* values.

To resolve this, we can bundle up our four computed values into a single data object, called a **list**. An R **list** is like a vector, except that each element has a name as well as an ordinal position, and elements can be retrieved (using the [] operator) either by name or position. Consider this example:

```{r list}
# Create a list
character_list <- list(Name = "Snoopy", Breed = "Beagle", Owner = "Charlie Brown")

# Display the entire list
character_list

# Some examples of selection from the list
character_list["Name"]
character_list["Breed"]

character_list[2]
character_list[3]

character_list$Name
character_list$Owner
```

A list is considered a single data object, so it can be returned from a function via a `return` statement.
Modify function `desc_stats` to **return** its four outputs, rather than printing them. Call your function and use any of the syntactic options above to explore the contents of the returned object. 

My solution is:

```{r output multiple items}
# Modify the function to return a list
desc_stats <- function(dv_vector)
{
  mean_dv <- mean(dv_vector, na.rm = TRUE)
  sd_dv <- sd(dv_vector, na.rm = TRUE)
  min_dv <- min(dv_vector, na.rm = TRUE)
  max_dv <- max(dv_vector, na.rm = TRUE)

  result_list <- list(Mean = mean_dv,
                      Sd = sd_dv,
                      Min = min_dv,
                      Max = max_dv)
  
  return(result_list)
}


flipper_desc_stats <- desc_stats(penguins$flipper_length_mm)

flipper_desc_stats$Mean

flipper_desc_stats[2]

min_max <- c(flipper_desc_stats["Min"], flipper_desc_stats["Max"])

print(min_max)


```


# Scope

When you specify a variable in R it will start trying to find something with that name within the global environment (displayed in the _Environment_ tab in RStudio). In the case of functions, any variable defined in the function (including through its arguments) stays within the function (a separate local environment specific to the function). If however in the body of the function you refer to a variable that hasn't been defined in the function, R will start looking at the global environment and if it finds a variable of the same name you've created outside of the function, it will use the value that is stored within it. This behaviour can cause issues.


Here is an example, where the function needs a value for `n` but it hasn't been supplied as an argument and there is no default value.

```{r, error=TRUE}
# multiplies the number x by the number n
multiply_by_n <- function(x){
  x * n
}

multiply_by_n(x = 3)
```

In the function body we referred to `n` which wasn't defined anywhere so we got an error.


Let's use the same function again but define `n` outside the function:
```{r}
# multiplies the number x by the number n
multiply_by_n <- function(x){
  x * n
}

# define n in the global environment 
n <- 10

multiply_by_n(x = 3)
```
This time R looks for `n` inside the body but doesn't find it and when it looks into the global environment it finds a variable named `n` and so uses that value.
 
 This time we're going to modify the function to take a second argument called `n`, and also have `n` defined in the global environment:
```{r}
# multiplies the number x by the number n
multiply_by_n <- function(x, n){
  x * n
}

# define n in the global environment
n <- 10

multiply_by_n(x = 3, n = 2)
```
 
The value of `n` that was used was the value supplied as the argument, rather than the version that was defined in the global environment. Thus a function will use a locally defined variable when one exists, but if it can't find one, it will look in the global environment.

This behaviour can lead to subtle errors in your code. For example, earlier in this module we wrote a version of user-defined function `calc_area` that tried to reference variable `height` without having passed it in as an argument. When R was unable to find `height`, it alerted us with an error message, and we were able to identify and correct the error in the code (the missing input argument `height`). If, at that point, we had already created a variable `height` outside the function (for any reason), R would have happily **used that existing variable when executing the function code body** and we would not have realised there was an error. Of course, if we ever called the bad function when we hadn't previously created `height`, our program would crash. Some programmers use special variable coding styles -- for example prefixing all input arguents with underscore -- to prevent confusion between local and global variables. It is very important, when working in R to be constantly aware of which variables exist in the global environment. Keep your eye on the Environment tab as you write functions.



# Complex Program Behaviour

In very simple programs, we write a set of commands which are executed in sequence -- first statement, second statement, third statement, etc. Each time we run the code, the exact same set of commands are executed, in the exact same sequence. 

However, as computational problems become more complex, the behaviour of our code solutions must also become more complex. We may have sections of code that we want to repeat some varying number times depending on our data, or sections of code that we only want to execute under certain conditions.

The specific time course of program execution is called **flow of control**, and R provides many syntactic features that allow us to manage it. Flow of control generally depends on the program's **state** -- the specific set of variables and their values at each point during program execution. 

For example, when we import a csv file into a data frame, that data frame is part of the state. We may want to execute a function once for each row in the data frame. If we import a data frame with 53 rows, we want the function to be called 53 times. Thus the number of times the function is called (the flow of control) depends on the state. Similarly, when we write a function that accepts data input arguments, the value of those arguments passed in at function call are part of the state, and can be different each time the function is called. Frequently we will want to take different actions in the function depending on that state. 

We have described the three primary flow of control constructs:

-  Sequential execution: Statements are executed in order
-  Iteration: A set of statements is repeated a number of times
-  Conditional execution: A set of statements may or may not be executed, depending on the state.

We have already seen, and written, code that contains only sequential exection. We will now look at the syntactic tools for conditional execution. In our next module, we will cover iteration, which is syntactically more complex.

## If statements

All modern programming languages allow you to wrap a block of code in an **if statement**. If statements contain a **condition**, which is an expression which evaluates to either true or false. At runtime, the condition is evaluated. If it is true, the block of code is executed; if it is false, the block of code is not executed. Conditional code blocks in R are delineated with curly brackets. The conditional expression is delineated with round brackets.

Schematically:

```{r conditional pattern, eval = FALSE}
if (condition) {
  # code here is only run if condition was TRUE
}
```

Consider this toy example:

```{r simple if}
did_I_pass_the_paper <- function(my_mark)
{
  if (my_mark > 50) {
    print("You passed!")
  }
}

# This call generates no output
did_I_pass_the_paper(14)

# This call generates output
did_I_pass_the_paper(73)
```


## Adding an alternative with `else`

Frequently we wish to define two behaviours for a conditional expression -- one for when it is true and another for when it is false. In R we do this with the keyword `else`

```{r if else pattern, eval = FALSE}
if (condition) {
  # code here is only run if condition was TRUE
} else {
  # code here is only run if condition was FALSE
}


```

For example:
```{r if else}
did_I_pass_the_paper <- function(my_mark)
{
  if (my_mark > 50) {
    print("You passed! :)")
  } else {
    print("Sorry, you didn't pass. :(")
  }
}

# This call runs the else block
did_I_pass_the_paper(14)

# This call runs the if block
did_I_pass_the_paper(73)
```


NB: Pay careful attention to the placement of the curly brackets for both the if block and the else block. The first curly bracket *must sit on the same line as the if statement*. The `else` keyword must be on the same line as the closing curly bracket of the `if` block, and must be followed, on that same line, by the opening curly bracket of the `else` block. It is a known peculiarity of the R language that it is **extremely fussy** about this rule. No use fighting it; just follow the rule.

An extension of `else` is the `else if` construct that lets you link a series of conditions. The conditions are tested one at a time from the top and the first condition that evaluates to `TRUE` is the only code block that gets run. For example:


```{r else if}
bmi_category <- function(bmi)
{
    if(bmi > 30){
      print("obese")
    } else if (bmi > 25){
      print("overweight")
    } else if (bmi > 20){
      print("healthy")
    } else {
      print("underweight")
  }
}


bmi_category(22)

bmi_category(18)
```

Conditional statements can be nested. That is, inside the `if` or `else` block, you can have more conditional statements, each of which can have `if` and `else` blocks, each of which can in turn have nested conditional statements, and so on. For complex computations, the conditional logic can become very elaborate, and needs to be approached carefully. I find it helpful in these cases to sketch out a flow chart showing all the different outcomes based on state, and use that as a pattern for writing and arranging the various code blocks.


## Complex conditional expressions

In our previous examples, we wrote conditional expressions using the > (greater than) operator. The expression `my_mark > 50` evaluates to either true or false (i.e. a **boolean value**). If variable `my_mark` is greater than 50, the expression returns true, otherwise it returns false. Greater than is a **comparison operator**. The R comparison operators are:

Operator | Meaning
---------|---------
`==` | equal to
`!=` | not equal to
`<` | less than
`<=` | less than or equal to
`>` | greater than
`>=` | greater than or equal to


The **Boolean logic operators** can be used in to modify or combine conditional expressions.


Boolean Operation | Symbol in R
---|---
NOT | !
OR | \|
AND | &


For example, the following function might be used to check that a value entered as a penguin body mass was within the expected weight range for the species.

```{r compound conditional}
# Chinstrap penguins weight between 3 and 5 kg

check_chinstrap_weight <- function(weight_to_check)
{
  if ((weight_to_check >= 3000) & (weight_to_check <= 5000)) {
    print("Weight ok")
  } else {
    print("That's probably a typo")
  }
}


check_chinstrap_weight(4100)


check_chinstrap_weight(410)

```

The result of the NOT, AND, and OR are shown in the logic table:

Statement | Becomes
---|---|---|---
  !TRUE | `r !TRUE`
 !FALSE | `r !FALSE` 
TRUE & TRUE | `r TRUE & TRUE`
TRUE & FALSE | `r TRUE & FALSE`
FALSE & TRUE | `r FALSE & TRUE`
FALSE & FALSE | `r FALSE & FALSE`
TRUE \| TRUE | `r TRUE | TRUE`
TRUE \| FALSE | `r TRUE | FALSE`
FALSE \| TRUE | `r FALSE | TRUE`
FALSE \| FALSE | `r FALSE | FALSE`



## Conditional exercise
Write and test a function that determines whether a student receives a passing grade on an assessment. The function should accept two input args: the number of marks earned, and the total number of marks possible for the assessment. The student must earn 50% of the available marks in order to pass. For example, if a student earns 18 marks on a 20 mark assessment they pass, but if they earn only 8 marks, they fail. Your function should print "Pass" or "Fail" as appropriate based on the input data. 

```{r exercise solution}

pass_check <- function(earned, possible)
{
  mark <- earned/possible
  
  if (mark > 0.5){
    print("Pass")
  } else {
    print("Fail")
  }
  
}


pass_check(18, 20)
pass_check(8,20)
pass_check(18,100)
pass_check(8,10)

```


# Function Error Checking

We can use conditional control statements to provide error checking in user-defined functions.

## Failing

One of the sayings in programming is "if it's going to fail, it's best to fail early". That is, if we know that our function requires a specific input data type, we want to program **defensively** so that our function "fails" before it encounters the error. As part of our defensive programming we can provide informative error messages, rather than rely on R's generic ones.


In the following example, we check that the data coming into our function is numeric. If it is not, we use function `stop`, to exit the function, displaying our informative message.


```{r, error=TRUE}
# Returns the provided number doubled
double_number <- function(x) {
  if( !is.numeric(x) ){
    stop("x needs to be a number.")
  }
  x * 2
}


double_number(4)

double_number("a")
```

N.B. Check the appendix for more on data types.

# Conclusion
This document has presented an introduction to creating your own functions and implementing conditional flow of control. For more detail, see the free online books [Advanced R](https://adv-r.hadley.nz) and [R packages](https://r-pkgs.org).


# What's Next

Fill in the module feedback form [https://tinyurl.com/r4ssp-module-fb](https://tinyurl.com/r4ssp-module-fb).

Next move onto [R for Data Science Chapter 21 - Iteration (https://r4ds.had.co.nz/iteration.html)](https://r4ds.had.co.nz/iteration.html) to learn about how we can reduce code duplication, by repeating code efficiently through iteration or 'loops'. As always, if you run into trouble, let us know.

# Appendix

## Data types

Not all data are created equal, in R this concept is captured by _data types_. For a vector, all values must be of a single data type.

The main data types that you will encounter in R are:

- _Logical_ ( `c(TRUE, FALSE)`)
- _Numeric_ - also called _Real_ or _Double_ (Numbers that have a floating point (decimal) representation e.g. `c(1, 3.6, 1e3)`)
- _Character_ - also called _String_ (anything inside matching opening and closing quotes (single or double) e.g. `c("a", "some words", "animal"`))

There are 3 other less common:

- _Integer_ (integers `c(1L, 4L, -3L)`)
- _Complex_ (Complex numbers e.g. `c(0+3i, 4i, -2-5i)`)
- _Raw_ (the bytes of a file)


Each data type is known in R as an _atomic_ vector. R has built in functions to be able to determine the data type of a vector, `typeof()` is the best one to use, but others such as `str()` and `class()` can be used.

There is also a series of functions that let us do explicit checking for a data type which will return `TRUE` or `FALSE`:

- `is.logical()`
- `is.numeric()` or `is.double()`
- `is.character()`
- `is.integer()`
- `is.complex()`
- `is.raw()`



### Data Type Coercion

In R, when doing operations on multiple vectors, they all need to be the same data type - but how can this work if we have for example a numeric vector and a character vector? _Coercion_ is how R deals with trying to operate on two vectors of different data types. What this means in practice is that R will convert the data type of a vector in a defined manner such that we end up will all of the same type and follows a "lowest common data type"
 approach. Using the 3 main data types from above, the following is the order in which they will be coerced into the next data type: _logical_ -> _numeric_ -> _character_.
 
This principle applies when you try to create a vector of mixed data types too, R will coerce everything until it is a single data type.
 

See if you can predict what data type the result will be (you can check by using `typeof()`: 
```{r, eval = FALSE}
# logical and numeric
c(4, TRUE, 5)

# numeric and character
c(1, 3, "A")

# logical and character
c(FALSE, "cat","frog")

# mixed
c("see", TRUE, 4.8)

# tricky
c("1.3", "4", TRUE)
```

We can also explicitly force coercion into a particular data type by using the following:

- `as.logical()`
- `as.numeric()`
- `as.character()`

The other data types also have similarly named functions. When going against the normal direction of coercion, it is important to realise that if your data doesn't have a representation in that data type, it will become _NA_ (missing).
