Associated Material

Zoom notes: Zoom notes 07 - Functions and Choices

Readings:


Introduction

As the work you do in R becomes more complex, your R scripts will get longer, and may become difficult to manage. You may find that it is difficult to locate particular bits of code, or that you seem to be writing the same, or very similar bits of code, over and over again. If you want to modify those bits, there are multiple place in the script you have to edit, typos start to creep in, and the whole process becomes unpleasant.

To prevent this, large programs need to be organised into logical modules. Sections of code that perform a clearly delineated task can be enacapsulated and named. The encapsulated code can be invoked simply by typing the name – no need to cut and paste, no need to modify multiple copies. Scripts are shorter and easier to manage. Code that is organised this way is said to be modular. Modular code is better code.

In R, the main code module is the function. We have already used many functions, like read.csv and sqrt and ggplot. We invoke them in our code as single commands, but each of these commands actually ecapsulates many, many (add another “many” for ggplot) lines of code. We call the function by name, and all the associated code is executed.

Many functions in R will display their code in the console. Type the function name without the () into the console, and the encapsulated commands will be shown. (Look at the code for function filter.)

To make our own code modular, we can define our own functions. We write the code for a function somewhere in our script file. We can then call the function by name anywhere else in our script file, and all the code is executed. Just as with built-in functions, we can pass data into our function, and it can operate on those data. We can also arrange for our function to return a result that we can store in a variable.

Function Declaration Syntax

A function in R is comprised of four parts:

  1. a name
  2. the body (the code that does something)
  3. (optionally) input data
  4. (optionally) an output that can be stored

Creating a function

We will begin with the simplest function, one that accepts no input data, and returns no result. We declare the function using keyword function. Schematically, a user-defined function is created as:

name_of_function <- function()
{
   Code Body
}

The curly braces enclose the code body.

Subesquently we call the function as:

name_of_function()

When the call is executed, all the commands in the code body are run.

For example, in the current public health situation, we often see discussions of what body temperature constitutes a fever. We might want to be able to translate that temperature from Celsius to Fahrenheit (as used in the North American literature). This is a logically well-delineated computation, so we would encapsulate it in a function.

# Declare/define the function
fever_in_fahrenheit <- function()
{
  fever_in_celsius <- 37.5
  converted_to_fahrenheit <-(fever_in_celsius*9/5) + 32
  print(converted_to_fahrenheit)
}
# Call a user-defined function by name
fever_in_fahrenheit()
#> [1] 99.5

Some things to note:

  • Typing out the function declaration is not enough to make the function available for calling. The declaration code (from the start of the name to the closing curly bracket, inclusive) must first be executed. Execute this code in a script as we always do, by selecting all the code (with the mouse) and typing ctrl-Enter (Windows) or cmd-Enter (Mac). This is called sourcing the function.
  • When you source a function, nothing happens. In our example, you will not see the output of the print statement when you execute the declaration code. Sourcing a function declaration does not run the code body. It simply parses the code body and, if there are no errors, stores it in the environment. Effectively, it informs RStudio that a function with this name exists, and defines the code which it encapsulates, so it can be called later.
  • To execute the code body, state the name of the function followed by (), with no intervening spaces. This calls the function; we think of () as the call operator. Whether the code body contains 1 line or 1,000 lines, or more, calling the function by name runs all the encapsulated code.

In R a function cannot be called until after the function has been sourced. Consider the following example:


# If you try to call the function before it is declared and sourced
fever_in_fahrenheit()
#> Error in fever_in_fahrenheit(): could not find function "fever_in_fahrenheit"

fever_in_fahrenheit <- function()
{
  fever_in_celsius <- 37.5
  converted_to_fahrenheit <-(fever_in_celsius*9/5) + 32
  print(converted_to_fahrenheit)
}

There are more advanced techniques that allow us to call functions contained in other script files where we cannot directly select and execute the declaration code. See the reading for discussion.

Providing data to a function

Consider the following function, which demonstrates how to compute BMI (body mass index) in R, using the given values for height (in metres) and weight (in kg):


# Declare function calc_BMI
calc_BMI <- function()
{
  weight <- 73
  height <- 1.68
  bmi <- weight/height^2 # ^ is the exponentiation operator
  
  print(bmi)
}

# Call function calc_BMI
calc_BMI()
#> [1] 25.86451

This function contains nice tidy code and is mathematically correct, but unless we happen to want the BMI of a person with exactly this height and weight, it is of no use to us. A useful function defines the computation (in this case taking the ratio of weight to height squared), and accepts the data values when it is called.

We have seen this many times when writing R code, when we call a function repeatedly on different input values:


sqrt(14)
#> [1] 3.741657

sqrt(820)
#> [1] 28.63564

sqrt(0.65)
#> [1] 0.8062258

To declare a function that can accept input arguments, perform these steps:

  1. Between the round brackets that follow the keyword function, place a variable name for each piece of data you wish to input when the function is called.
  2. In the function body refer to the input variables by the name you placed between the round brackets. You DO NOT need to initialise the variables inside the code body – in fact you MUST NOT do so. Variables with those names are automatically created for you when the function is called.
  3. When calling the function, provide a value for each input variable. You do not need to repeat the variable name, just provide values, separated by commas, in the same order as the variables are listed in the declatration.
  4. When the function is called, the system creates the input variables, assigns them the corresponding values from the call statement, and executes the code body, using those variables.

We can modify our calc_BMI function to accept weight and height as input variables, as shown. Compare this version to the earlier version. Note that we do not declare and initialise weight and height inside the code body.

# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
  # Use input arguments weight and height. DON'T initialise them.
  bmi <- weight/height^2
  
  print(bmi)
}

# Call function calc_BMI
calc_BMI(73, 1.68)
#> [1] 25.86451

Following the rules

Declaring a function with inputs defines the required interface of the function. That is, it defines what you have to provide if you want to call the function. If the caller violates the interface, the code is not guaranteed to work. For example, function calc_bmi is declared with two input arguments. Therefore, all of these calls violate the interface.

calc_BMI()
#> Error in calc_BMI(): argument "weight" is missing, with no default
calc_BMI(16)
#> Error in calc_BMI(16): argument "height" is missing, with no default
calc_BMI(73, 168, 42)
#> Error in calc_BMI(73, 168, 42): unused argument (42)

In some programming languages, one is required to specify the data type (e.g. number, string, data frame, etc.) of each input argument. Code that tries to pass in the wrong type of data will not compile. In R this is not required. R will try to run your code no matter what sort of data it gets. However, if it tries to operate on the wrong type of data, it will throw an error:

calc_BMI(42, "fred")
#> Error in height^2: non-numeric argument to binary operator

Naturally, R doesn’t understand how to apply the exponentiation operator to “fred”, and it is telling you so.

More on the rules

Consider the following code:

# Declare the function
compute_area <- function(width)
{
  area <- width * height
  print(area)
}

# Call the function
compute_area(35)

The function compute_area is defined with one argument. Therefore, when it is called, we must provide one value between the round brackets. In the call, we have correctly provided one argument, of the correct data type. Will the call compute_area(35) work correctly, or will it throw an error? What error will it throw?

# Declare the function
compute_area <- function(width)
{
  area <- width * height
  print(area)
}

# Call the function
compute_area(35)
#> Error in compute_area(35): object 'height' not found

When compute_area(35) is executed, it produces an error message: ...object 'height' not found. In the code body of compute_area, we refer to an entity height. Since that entity is not surrounded by double-quotes, R expects to find a variable named height existing in the environment, and no such variable exists, because we have neither created one directly (with an assignment statement) nor passed one in as an argument to the function.

In the same line of code, we also refer to an entity width. Note that R does not complain about being unable to find width. That is because we defined an input argument called width.

  1. How would you modify function compute_area to eliminate the error?
  2. How would this change the form of the call to compute_area?

Taking your time

Traditionally, new programmers find the syntax of argument passing extremely confusing. There are too many interacting parts: we have variable names in the declaration, variables used in the code body, and values passed in the call. At first exposure, it can be unclear how all these parts work together. If this is the first time you have seen this syntax, take some time to experiment with it to solidify your understanding. Make up some simple user defined functions of your own to practice managing input data.

Getting output from a function

In all of our examples so far, we have used a print statement to show the result of the function’s computation. This is, of course, not adequate in real programming, as a print statement simply writes to the console. The computed value is not available for use later in our script. We want to be able to save the result of a function into a variable, as we have done with the built-in functions we have used earlier in the course. For example, we have said x <- sqrt(22), creating a variable called x that stores the square root of 22. We could then use x in subsequent computations, as needed.

When a function makes its result available for storage, we say it is returning its result. The command sqrt(22) returns the square root of 22, and we can store it in a variable using the assignment statement.

We can make our user-defined function behave in the same way, by using return instead of print. For example, we can modify calc_bmi:

# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
  bmi <- weight/height^2 # ^ is the exponentiation operator
  
  return(bmi)
}

With this change to the function, we can store the result of the BMI computation in a variable for later use.


# Save the return value in a variable.
bmi_result <- calc_BMI(75, 1.75)

# Display by name, as we always can do in R with variables
bmi_result
#> [1] 24.4898

CAUTION

In R, the return keyword is actually optional. By default, R functions just return the value of their last line of code. So technically, you can write the function above like this without changing its behaviour:

# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
  bmi <- weight/height^2 # ^ is the exponentiation operator
  
  # Omitting the explicit call to return. Don't do this.
  bmi
}

You will see this shortcut used in R code in the wild. However, it is an old-fashioned syntax, and it can lead to subtle errors in complex R programs. We suggest that you avoid it. Make the behaviour of your functions clear by explicitly identifying the return value with function return.

Function exercise

Returning to the Palmers Penguins data, and using the techniques we have seen for performing descriptive statistics, consider the following code, which computes summaries of the body mass measure (i.e. the dependent variable in this summary is column body_mass_g).

library(palmerpenguins)

dv_vector <- penguins$body_mass_g

mean_dv <- mean(dv_vector, na.rm = TRUE)
sd_dv <- sd(dv_vector, na.rm = TRUE)
min_dv <- min(dv_vector, na.rm = TRUE)
max_dv <- max(dv_vector, na.rm = TRUE)

This code fragment creates four new variables that we could use in later computations, or in generating reports.

Imagine that we wish to do the same summary for flipper length. We could copy and paste the code, changing the line where we assign variable dv_vector. Then if we wanted to do the same summary on bill length, we could copy and paste the code again, and again change the assignment to dv_vector. By now, we are bored of this, and our script is getting needlessly big and messy. If we later decide we need to include the median, we will need to go back and add the median command multiple times, greatly increasing the chance of errors (and also being boring).

Computing this set of descriptive stats is a logically delineated task, and we should therefore encapsulate it into a user defined function. Each time we do this summary, the computation (the four calls to mean, sd, min and max) remains the same, but the data we wish to process changes. Each time we call the function, we want to be able to provide it with the data to process. We must therefore pass the data in as an input argument.

Make a function with input arguments

Convert this code into a function that accepts a single input argument, which will be the vector of data to be analysed. For now, use print statements to display the results, not return statements, so we can concentrate on getting the data in. Call your function desc_stats.

library(palmerpenguins)

desc_stats <- function(dv_vector)
{
  mean_dv <- mean(dv_vector, na.rm = TRUE)
  sd_dv <- sd(dv_vector, na.rm = TRUE)
  min_dv <- min(dv_vector, na.rm = TRUE)
  max_dv <- max(dv_vector, na.rm = TRUE)

  print(mean_dv)
  print(sd_dv)
  print(min_dv)
  print(max_dv)
}  

Source the function by selecting all the code (from before desc_stats down to and including the closing curly bracket) and typing ctrl-Enter or cmd-Enter. Test your function by calling it on the body mass data, the flipper length data, and the bill length data.

desc_stats(penguins$body_mass_g)
#> [1] 4201.754
#> [1] 801.9545
#> [1] 2700
#> [1] 6300
desc_stats(penguins$flipper_length_mm)
#> [1] 200.9152
#> [1] 14.06171
#> [1] 172
#> [1] 231
desc_stats(penguins$bill_length_mm)
#> [1] 43.92193
#> [1] 5.459584
#> [1] 32.1
#> [1] 59.6

Stop to admire how tidy and succinct your code is. Note that if you now decide to add the median, you only need to make the change in one place – in the function declaration itself – regardless of how many different data sets you have processed.

NB: If you do change the function, you must source the function again (i.e. you must repeat the “select all the function code and type ctrl-enter” step) to update R’s stored copy of the function. You will not see any change in behaviour until the modified function code is sourced.

Get output from the function

As discussed earlier, displaying function output via print statements is of limited utility; it is preferable to return the results of a function, so it can be stored in a variable. We would like, therefore, to return the output of function desc_stats. Unfortunately, a function in R can only return a single data object and our function computes four values.

To resolve this, we can bundle up our four computed values into a single data object, called a list. An R list is like a vector, except that each element has a name as well as an ordinal position, and elements can be retrieved (using the [] operator) either by name or position. Consider this example:

# Create a list
character_list <- list(Name = "Snoopy", Breed = "Beagle", Owner = "Charlie Brown")

# Display the entire list
character_list
#> $Name
#> [1] "Snoopy"
#> 
#> $Breed
#> [1] "Beagle"
#> 
#> $Owner
#> [1] "Charlie Brown"

# Some examples of selection from the list
character_list["Name"]
#> $Name
#> [1] "Snoopy"
character_list["Breed"]
#> $Breed
#> [1] "Beagle"

character_list[2]
#> $Breed
#> [1] "Beagle"
character_list[3]
#> $Owner
#> [1] "Charlie Brown"

character_list$Name
#> [1] "Snoopy"
character_list$Owner
#> [1] "Charlie Brown"

A list is considered a single data object, so it can be returned from a function via a return statement. Modify function desc_stats to return its four outputs, rather than printing them. Call your function and use any of the syntactic options above to explore the contents of the returned object.

My solution is:

# Modify the function to return a list
desc_stats <- function(dv_vector)
{
  mean_dv <- mean(dv_vector, na.rm = TRUE)
  sd_dv <- sd(dv_vector, na.rm = TRUE)
  min_dv <- min(dv_vector, na.rm = TRUE)
  max_dv <- max(dv_vector, na.rm = TRUE)

  result_list <- list(Mean = mean_dv,
                      Sd = sd_dv,
                      Min = min_dv,
                      Max = max_dv)
  
  return(result_list)
}


flipper_desc_stats <- desc_stats(penguins$flipper_length_mm)

flipper_desc_stats$Mean
#> [1] 200.9152

flipper_desc_stats[2]
#> $Sd
#> [1] 14.06171

min_max <- c(flipper_desc_stats["Min"], flipper_desc_stats["Max"])

print(min_max)
#> $Min
#> [1] 172
#> 
#> $Max
#> [1] 231

Scope

When you specify a variable in R it will start trying to find something with that name within the global environment (displayed in the Environment tab in RStudio). In the case of functions, any variable defined in the function (including through its arguments) stays within the function (a separate local environment specific to the function). If however in the body of the function you refer to a variable that hasn’t been defined in the function, R will start looking at the global environment and if it finds a variable of the same name you’ve created outside of the function, it will use the value that is stored within it. This behaviour can cause issues.

Here is an example, where the function needs a value for n but it hasn’t been supplied as an argument and there is no default value.

# multiplies the number x by the number n
multiply_by_n <- function(x){
  x * n
}

multiply_by_n(x = 3)
#> Error in multiply_by_n(x = 3): object 'n' not found

In the function body we referred to n which wasn’t defined anywhere so we got an error.

Let’s use the same function again but define n outside the function:

# multiplies the number x by the number n
multiply_by_n <- function(x){
  x * n
}

# define n in the global environment 
n <- 10

multiply_by_n(x = 3)
#> [1] 30

This time R looks for n inside the body but doesn’t find it and when it looks into the global environment it finds a variable named n and so uses that value.

This time we’re going to modify the function to take a second argument called n, and also have n defined in the global environment:

# multiplies the number x by the number n
multiply_by_n <- function(x, n){
  x * n
}

# define n in the global environment
n <- 10

multiply_by_n(x = 3, n = 2)
#> [1] 6

The value of n that was used was the value supplied as the argument, rather than the version that was defined in the global environment. Thus a function will use a locally defined variable when one exists, but if it can’t find one, it will look in the global environment.

This behaviour can lead to subtle errors in your code. For example, earlier in this module we wrote a version of user-defined function calc_area that tried to reference variable height without having passed it in as an argument. When R was unable to find height, it alerted us with an error message, and we were able to identify and correct the error in the code (the missing input argument height). If, at that point, we had already created a variable height outside the function (for any reason), R would have happily used that existing variable when executing the function code body and we would not have realised there was an error. Of course, if we ever called the bad function when we hadn’t previously created height, our program would crash. Some programmers use special variable coding styles – for example prefixing all input arguents with underscore – to prevent confusion between local and global variables. It is very important, when working in R to be constantly aware of which variables exist in the global environment. Keep your eye on the Environment tab as you write functions.

Complex Program Behaviour

In very simple programs, we write a set of commands which are executed in sequence – first statement, second statement, third statement, etc. Each time we run the code, the exact same set of commands are executed, in the exact same sequence.

However, as computational problems become more complex, the behaviour of our code solutions must also become more complex. We may have sections of code that we want to repeat some varying number times depending on our data, or sections of code that we only want to execute under certain conditions.

The specific time course of program execution is called flow of control, and R provides many syntactic features that allow us to manage it. Flow of control generally depends on the program’s state – the specific set of variables and their values at each point during program execution.

For example, when we import a csv file into a data frame, that data frame is part of the state. We may want to execute a function once for each row in the data frame. If we import a data frame with 53 rows, we want the function to be called 53 times. Thus the number of times the function is called (the flow of control) depends on the state. Similarly, when we write a function that accepts data input arguments, the value of those arguments passed in at function call are part of the state, and can be different each time the function is called. Frequently we will want to take different actions in the function depending on that state.

We have described the three primary flow of control constructs:

  • Sequential execution: Statements are executed in order
  • Iteration: A set of statements is repeated a number of times
  • Conditional execution: A set of statements may or may not be executed, depending on the state.

We have already seen, and written, code that contains only sequential exection. We will now look at the syntactic tools for conditional execution. In our next module, we will cover iteration, which is syntactically more complex.

If statements

All modern programming languages allow you to wrap a block of code in an if statement. If statements contain a condition, which is an expression which evaluates to either true or false. At runtime, the condition is evaluated. If it is true, the block of code is executed; if it is false, the block of code is not executed. Conditional code blocks in R are delineated with curly brackets. The conditional expression is delineated with round brackets.

Schematically:

if (condition) {
  # code here is only run if condition was TRUE
}

Consider this toy example:

did_I_pass_the_paper <- function(my_mark)
{
  if (my_mark > 50) {
    print("You passed!")
  }
}

# This call generates no output
did_I_pass_the_paper(14)

# This call generates output
did_I_pass_the_paper(73)
#> [1] "You passed!"

Adding an alternative with else

Frequently we wish to define two behaviours for a conditional expression – one for when it is true and another for when it is false. In R we do this with the keyword else

if (condition) {
  # code here is only run if condition was TRUE
} else {
  # code here is only run if condition was FALSE
}

For example:

did_I_pass_the_paper <- function(my_mark)
{
  if (my_mark > 50) {
    print("You passed! :)")
  } else {
    print("Sorry, you didn't pass. :(")
  }
}

# This call runs the else block
did_I_pass_the_paper(14)
#> [1] "Sorry, you didn't pass. :("

# This call runs the if block
did_I_pass_the_paper(73)
#> [1] "You passed! :)"

NB: Pay careful attention to the placement of the curly brackets for both the if block and the else block. The first curly bracket must sit on the same line as the if statement. The else keyword must be on the same line as the closing curly bracket of the if block, and must be followed, on that same line, by the opening curly bracket of the else block. It is a known peculiarity of the R language that it is extremely fussy about this rule. No use fighting it; just follow the rule.

An extension of else is the else if construct that lets you link a series of conditions. The conditions are tested one at a time from the top and the first condition that evaluates to TRUE is the only code block that gets run. For example:

bmi_category <- function(bmi)
{
    if(bmi > 30){
      print("obese")
    } else if (bmi > 25){
      print("overweight")
    } else if (bmi > 20){
      print("healthy")
    } else {
      print("underweight")
  }
}


bmi_category(22)
#> [1] "healthy"

bmi_category(18)
#> [1] "underweight"

Conditional statements can be nested. That is, inside the if or else block, you can have more conditional statements, each of which can have if and else blocks, each of which can in turn have nested conditional statements, and so on. For complex computations, the conditional logic can become very elaborate, and needs to be approached carefully. I find it helpful in these cases to sketch out a flow chart showing all the different outcomes based on state, and use that as a pattern for writing and arranging the various code blocks.

Complex conditional expressions

In our previous examples, we wrote conditional expressions using the > (greater than) operator. The expression my_mark > 50 evaluates to either true or false (i.e. a boolean value). If variable my_mark is greater than 50, the expression returns true, otherwise it returns false. Greater than is a comparison operator. The R comparison operators are:

Operator Meaning
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to

The Boolean logic operators can be used in to modify or combine conditional expressions.

Boolean Operation Symbol in R
NOT !
OR |
AND &

For example, the following function might be used to check that a value entered as a penguin body mass was within the expected weight range for the species.

# Chinstrap penguins weight between 3 and 5 kg

check_chinstrap_weight <- function(weight_to_check)
{
  if ((weight_to_check >= 3000) & (weight_to_check <= 5000)) {
    print("Weight ok")
  } else {
    print("That's probably a typo")
  }
}


check_chinstrap_weight(4100)
#> [1] "Weight ok"


check_chinstrap_weight(410)
#> [1] "That's probably a typo"

The result of the NOT, AND, and OR are shown in the logic table:

Statement Becomes
!TRUE FALSE
!FALSE TRUE
TRUE & TRUE TRUE
TRUE & FALSE FALSE
FALSE & TRUE FALSE
FALSE & FALSE FALSE
TRUE | TRUE TRUE
TRUE | FALSE TRUE
FALSE | TRUE TRUE
FALSE | FALSE FALSE

Conditional exercise

Write and test a function that determines whether a student receives a passing grade on an assessment. The function should accept two input args: the number of marks earned, and the total number of marks possible for the assessment. The student must earn 50% of the available marks in order to pass. For example, if a student earns 18 marks on a 20 mark assessment they pass, but if they earn only 8 marks, they fail. Your function should print “Pass” or “Fail” as appropriate based on the input data.


pass_check <- function(earned, possible)
{
  mark <- earned/possible
  
  if (mark > 0.5){
    print("Pass")
  } else {
    print("Fail")
  }
  
}


pass_check(18, 20)
#> [1] "Pass"
pass_check(8,20)
#> [1] "Fail"
pass_check(18,100)
#> [1] "Fail"
pass_check(8,10)
#> [1] "Pass"

Function Error Checking

We can use conditional control statements to provide error checking in user-defined functions.

Failing

One of the sayings in programming is “if it’s going to fail, it’s best to fail early”. That is, if we know that our function requires a specific input data type, we want to program defensively so that our function “fails” before it encounters the error. As part of our defensive programming we can provide informative error messages, rather than rely on R’s generic ones.

In the following example, we check that the data coming into our function is numeric. If it is not, we use function stop, to exit the function, displaying our informative message.

# Returns the provided number doubled
double_number <- function(x) {
  if( !is.numeric(x) ){
    stop("x needs to be a number.")
  }
  x * 2
}


double_number(4)
#> [1] 8

double_number("a")
#> Error in double_number("a"): x needs to be a number.

N.B. Check the appendix for more on data types.

Conclusion

This document has presented an introduction to creating your own functions and implementing conditional flow of control. For more detail, see the free online books Advanced R and R packages.

What’s Next

Fill in the module feedback form https://tinyurl.com/r4ssp-module-fb.

Next move onto R for Data Science Chapter 21 - Iteration (https://r4ds.had.co.nz/iteration.html) to learn about how we can reduce code duplication, by repeating code efficiently through iteration or ‘loops’. As always, if you run into trouble, let us know.

Appendix

Data types

Not all data are created equal, in R this concept is captured by data types. For a vector, all values must be of a single data type.

The main data types that you will encounter in R are:

  • Logical ( c(TRUE, FALSE))
  • Numeric - also called Real or Double (Numbers that have a floating point (decimal) representation e.g. c(1, 3.6, 1e3))
  • Character - also called String (anything inside matching opening and closing quotes (single or double) e.g. c("a", "some words", "animal"))

There are 3 other less common:

  • Integer (integers c(1L, 4L, -3L))
  • Complex (Complex numbers e.g. c(0+3i, 4i, -2-5i))
  • Raw (the bytes of a file)

Each data type is known in R as an atomic vector. R has built in functions to be able to determine the data type of a vector, typeof() is the best one to use, but others such as str() and class() can be used.

There is also a series of functions that let us do explicit checking for a data type which will return TRUE or FALSE:

  • is.logical()
  • is.numeric() or is.double()
  • is.character()
  • is.integer()
  • is.complex()
  • is.raw()

Data Type Coercion

In R, when doing operations on multiple vectors, they all need to be the same data type - but how can this work if we have for example a numeric vector and a character vector? Coercion is how R deals with trying to operate on two vectors of different data types. What this means in practice is that R will convert the data type of a vector in a defined manner such that we end up will all of the same type and follows a “lowest common data type” approach. Using the 3 main data types from above, the following is the order in which they will be coerced into the next data type: logical -> numeric -> character.

This principle applies when you try to create a vector of mixed data types too, R will coerce everything until it is a single data type.

See if you can predict what data type the result will be (you can check by using typeof():

# logical and numeric
c(4, TRUE, 5)

# numeric and character
c(1, 3, "A")

# logical and character
c(FALSE, "cat","frog")

# mixed
c("see", TRUE, 4.8)

# tricky
c("1.3", "4", TRUE)

We can also explicitly force coercion into a particular data type by using the following:

  • as.logical()
  • as.numeric()
  • as.character()

The other data types also have similarly named functions. When going against the normal direction of coercion, it is important to realise that if your data doesn’t have a representation in that data type, it will become NA (missing).

---
title: "Functions and Choices"
date: "Semester 2, 2022"
output:
  html_document:
    toc: true
    toc_float: true
    toc_depth: 3
    code_download: true
    code_folding: show
---

```{r setup, include=FALSE}
library(knitr)

knitr::opts_chunk$set(
  comment = "#>",
  fig.path = "figures/07/", # use only for single Rmd files
  collapse = TRUE,
  echo = TRUE
)


```

> #### Associated Material
>
> Zoom notes: [Zoom notes 07 - Functions and Choices](zoom_notes_07.html)
> 
> Readings:
>
> - [R for Data Science - Chapter 19](https://r4ds.had.co.nz/functions.html)

\

# Introduction

As the work you do in R becomes more complex, your R scripts will get longer, and may become difficult to manage. You may find that it is difficult to locate particular bits of code, or that you seem to be writing the same, or very similar bits of code, over and over again. If you want to modify those bits, there are multiple place in the script you have to edit, typos start to creep in, and the whole process becomes unpleasant.

To prevent this, large programs need to be organised into **logical modules**. Sections of code that perform a clearly delineated task can be enacapsulated and named. The encapsulated code can be invoked simply by typing the name -- no need to cut and paste, no need to modify multiple copies. Scripts are shorter and easier to manage. Code that is organised this way is said to be **modular**. Modular code is better code.

In R, the main code module is the **function**. We have already used many functions, like `read.csv` and `sqrt` and `ggplot`. We invoke them in our code as single commands, but each of these commands actually ecapsulates many, many (add another "many" for ggplot) lines of code. We call the function by name, and all the associated code is executed.

Many functions in R will display their code in the console. Type the function name **without the ()** into the console, and the encapsulated commands will be shown. (Look at the code for function `filter`.)

To make our own code modular, we can define our own functions. We write the code for a function somewhere in our script file. We can then **call** the function by name anywhere else in our script file, and all the code is executed. Just as with built-in functions, we can pass data into our function, and it can operate on those data. We can also arrange for our function to **return a result** that we can store in a variable. 


# Function Declaration Syntax

A function in R is comprised of four parts:

1. a name
2. the body (the code that does something)
3. (optionally) input data
4. (optionally) an output that can be stored


## Creating a function

We will begin with the simplest function, one that accepts no input data, and returns no result. We **declare** the function using keyword `function`. Schematically, a user-defined function is created as:

```
name_of_function <- function()
{
   Code Body
}
``` 

The curly braces enclose the code body.


Subesquently we **call** the function as:

*name_of_function*()

When the call is executed, all the commands in the code body are run.

For example, in the current public health situation, we often see discussions of what body temperature constitutes a fever. We might want to be able to translate that temperature from Celsius to Fahrenheit (as used in the North American literature). This is a logically well-delineated computation, so we would encapsulate it in a function.

```{r declare a function}
# Declare/define the function
fever_in_fahrenheit <- function()
{
  fever_in_celsius <- 37.5
  converted_to_fahrenheit <-(fever_in_celsius*9/5) + 32
  print(converted_to_fahrenheit)
}
```

```{r call}
# Call a user-defined function by name
fever_in_fahrenheit()
```

Some things to note:

-  Typing out the function declaration is not enough to make the function available for calling. The declaration code (from the start of the name to the closing curly bracket, inclusive) **must first be executed**. Execute this code in a script as we always do, by selecting all the code (with the mouse) and typing ctrl-Enter (Windows) or cmd-Enter (Mac). This is called **sourcing** the function.
- When you source a function, **nothing happens**. In our example, you will not see the output of the print statement when you execute the declaration code. Sourcing a function declaration **does not run the code body**. It simply *parses* the code body and, if there are no errors, stores it in the environment. Effectively, it informs RStudio that a function with this name exists, and defines the code which it encapsulates, so it can be called later.
- To execute the code body, state the name of the function followed by (), with no intervening spaces. This **calls** the function; we think of () as the **call operator**. Whether the code body contains 1 line or 1,000 lines, or more, calling the function by name runs all the encapsulated code.


In R a function cannot be called until *after* the function has been sourced. Consider the following example:

```{r clear, echo=FALSE}
rm(list = ls())
```

```{r wrong order, error=TRUE}

# If you try to call the function before it is declared and sourced
fever_in_fahrenheit()

fever_in_fahrenheit <- function()
{
  fever_in_celsius <- 37.5
  converted_to_fahrenheit <-(fever_in_celsius*9/5) + 32
  print(converted_to_fahrenheit)
}

```

There are more advanced techniques that allow us to call functions contained in other script files where we cannot directly select and execute the declaration code. See the reading for discussion.


## Providing data to a function

Consider the following function, which demonstrates how to compute BMI (body mass index) in R, using the given values for height (in metres) and weight (in kg):

```{r calc BMI}

# Declare function calc_BMI
calc_BMI <- function()
{
  weight <- 73
  height <- 1.68
  bmi <- weight/height^2 # ^ is the exponentiation operator
  
  print(bmi)
}

# Call function calc_BMI
calc_BMI()

```
This function contains nice tidy code and is mathematically correct, but unless we happen to want the BMI of a person with exactly this height and weight, it is of no use to us. A useful function defines the computation (in this case taking the ratio of weight to height squared), and **accepts the data values** when it is called.

We have seen this many times when writing R code, when we call a function repeatedly on different input values:

```{r built in inputs}

sqrt(14)

sqrt(820)

sqrt(0.65)

```

To declare a function that can accept input arguments, perform these steps:

1. Between the round brackets that follow the keyword `function`, place a **variable name** for each piece of data you wish to input when the function is called.
2. In the function body refer to the input variables **by the name you placed between the round brackets**. You DO NOT need to initialise the variables inside the code body -- in fact you MUST NOT do so. Variables with those names are automatically created for you when the function is called. 
3. When calling the function, provide **a value** for each input variable. You do not need to repeat the variable name, just provide values, separated by commas, in the same order as the variables are listed in the declatration.
4. When the function is called, the system creates the input variables, assigns them the corresponding values from the call statement, and executes the code body, using those variables.

We can modify our `calc_BMI` function to accept weight and height as input variables, as shown. Compare this version to the earlier version. Note that we *do not* declare and initialise weight and height inside the code body.

```{r inputs}
# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
  # Use input arguments weight and height. DON'T initialise them.
  bmi <- weight/height^2
  
  print(bmi)
}

# Call function calc_BMI
calc_BMI(73, 1.68)

```

### Following the rules

Declaring a function with inputs defines the required **interface** of the function. That is, it defines what you have to provide if you want to call the function. If the caller violates the interface, the code is not guaranteed to work. For example, function `calc_bmi` is declared with two input arguments. Therefore, all of these calls violate the interface.

```{r bad args 1, error=TRUE}
calc_BMI()
```

```{r bad_args 2, error=TRUE}
calc_BMI(16)
```

```{r bad args 3, error=TRUE}
calc_BMI(73, 168, 42)
```

In some programming languages, one is required to specify the data type (e.g. number, string, data frame, etc.) of each input argument. Code that tries to pass in the wrong type of data will not compile. In R this is not required. R will try to run your code no matter what sort of data it gets. However, if it tries to operate on the wrong type of data, it will throw an error:


```{r bad args 4, error=TRUE}
calc_BMI(42, "fred")
```

Naturally, R doesn't understand how to apply the exponentiation operator to "fred", and it is telling you so.

### More on the rules

Consider the following code:

```{r args 2a, eval = FALSE}
# Declare the function
compute_area <- function(width)
{
  area <- width * height
  print(area)
}

# Call the function
compute_area(35)
```


The function `compute_area` is defined with one argument. Therefore, when it is called, we must provide one value between the round brackets. In the call, we have correctly provided one argument, of the correct data type. Will the call `compute_area(35)` work correctly, or will it throw an error? What error will it throw?

```{r args 2b, error = TRUE}
# Declare the function
compute_area <- function(width)
{
  area <- width * height
  print(area)
}

# Call the function
compute_area(35)
```

When `compute_area(35)` is executed, it produces an error message: `...object 'height' not found`. In the code body of `compute_area`, we refer to an entity `height`. Since that entity is not surrounded by double-quotes, R expects to find a variable named **height** existing in the environment, and no such variable exists, because we have neither created one directly (with an assignment statement) nor passed one in as an argument to the function. 

In the same line of code, we also refer to an entity `width`. Note that R does not complain about being unable to find `width`. That is because we defined an input argument called **width**. 

1. How would you modify function `compute_area` to eliminate the error?
2. How would this change the form of the call to `compute_area`?


### Taking your time

Traditionally, new programmers find the syntax of argument passing *extremely* confusing. There are too many interacting parts: we have variable names in the declaration, variables used in the code body, and values passed in the call. At first exposure, it can be unclear how all these parts work together. If this is the first time you have seen this syntax, take some time to experiment with it to solidify your understanding. Make up some simple user defined functions of your own to practice managing input data.


## Getting output from a function

In all of our examples so far, we have used a print statement to show the result of the function's computation. This is, of course, not adequate in real programming, as a print statement simply writes to the console. The computed value is not available for use later in our script. We want to be able to save the result of a function into a variable, as we have done with the built-in functions we have used earlier in the course. For example, we have said `x <- sqrt(22)`, creating a variable called *x* that stores the square root of 22. We could then use `x` in subsequent computations, as needed.

When a function makes its result available for storage, we say it is **returning** its result. The command `sqrt(22)` **returns** the square root of 22, and we can store it in a variable using the assignment statement.

We can make our user-defined function behave in the same way, by using `return` instead of `print`. For example, we can modify `calc_bmi`:

```{r return}
# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
  bmi <- weight/height^2 # ^ is the exponentiation operator
  
  return(bmi)
}
```

With this change to the function, we can store the result of the BMI computation in a variable for later use.

```{r saving a return}

# Save the return value in a variable.
bmi_result <- calc_BMI(75, 1.75)

# Display by name, as we always can do in R with variables
bmi_result
```

### CAUTION

In R, the `return` keyword is actually optional. By default, R functions just return the value of their last line of code. So technically, you can write the function above like this without changing its behaviour:

```{r blank return}
# Declare function calc_BMI
calc_BMI <- function(weight, height)
{
  bmi <- weight/height^2 # ^ is the exponentiation operator
  
  # Omitting the explicit call to return. Don't do this.
  bmi
}
```

You will see this shortcut used in R code in the wild. However, it is an old-fashioned syntax, and it can lead to subtle errors in complex R programs. We suggest that you avoid it. Make the behaviour of your functions clear by explicitly identifying the return value with function `return`.


## Function exercise

Returning to the Palmers Penguins data, and using the techniques we have seen for performing descriptive statistics, consider the following code, which computes summaries of the body mass measure (i.e. the dependent variable in this summary is  column *body_mass_g*).

```{r summarising, warning=FALSE, message=FALSE}
library(palmerpenguins)

dv_vector <- penguins$body_mass_g

mean_dv <- mean(dv_vector, na.rm = TRUE)
sd_dv <- sd(dv_vector, na.rm = TRUE)
min_dv <- min(dv_vector, na.rm = TRUE)
max_dv <- max(dv_vector, na.rm = TRUE)

```


This code fragment creates four new variables that we could use in later computations, or in generating reports. 

Imagine that we wish to do the same summary for flipper length. We could copy and paste the code, changing the line where we assign variable `dv_vector`. Then if we wanted to do the same summary on bill length, we could copy and paste the code again, and again change the assignment to `dv_vector`. By now, we are bored of this, and our script is getting needlessly big and messy. If we later decide we need to include the median, we will need to go back and add the `median` command multiple times, greatly increasing the chance of errors (and also being boring).

Computing this set of descriptive stats is a logically delineated task, and we should therefore encapsulate it into a user defined function. Each time we do this summary, the computation (the four calls to `mean`, `sd`, `min` and `max`) remains the same, but the data we wish to process changes. Each time we call the function, we want to be able to provide it with the data to process. We must therefore pass the data in as an input argument.

### Make a function with input arguments

Convert this code into a function that accepts a single input argument, which will be the vector of data to be analysed. For now, use print statements to display the results, not return statements, so we can concentrate on getting the data in. Call your function `desc_stats`.

```{r summarising function}
library(palmerpenguins)

desc_stats <- function(dv_vector)
{
  mean_dv <- mean(dv_vector, na.rm = TRUE)
  sd_dv <- sd(dv_vector, na.rm = TRUE)
  min_dv <- min(dv_vector, na.rm = TRUE)
  max_dv <- max(dv_vector, na.rm = TRUE)

  print(mean_dv)
  print(sd_dv)
  print(min_dv)
  print(max_dv)
}  
```

Source the function by selecting all the code (from before `desc_stats` down to and including the closing curly bracket) and typing ctrl-Enter or cmd-Enter. Test your function by calling it on the body mass data, the flipper length data, and the bill length data.

```{r calling}
desc_stats(penguins$body_mass_g)
desc_stats(penguins$flipper_length_mm)
desc_stats(penguins$bill_length_mm)
```

Stop to admire how tidy and succinct your code is. Note that if you now decide to add the median, you only need to make the change in one place -- in the function declaration itself -- regardless of how many different data sets you have processed.

**NB: If you do change the function, you must source the function again (i.e. you must repeat the "select all the function code and type ctrl-enter" step) to update R's stored copy of the function. You will not see any change in behaviour until the modified function code is sourced.**

### Get output from the function

As discussed earlier, displaying function output via print statements is of limited utility; it is preferable to return the results of a function, so it can be stored in a variable. We would like, therefore, to return the output of function `desc_stats`. Unfortunately, a function in R **can only return a single data object** and our function computes *four* values.

To resolve this, we can bundle up our four computed values into a single data object, called a **list**. An R **list** is like a vector, except that each element has a name as well as an ordinal position, and elements can be retrieved (using the [] operator) either by name or position. Consider this example:

```{r list}
# Create a list
character_list <- list(Name = "Snoopy", Breed = "Beagle", Owner = "Charlie Brown")

# Display the entire list
character_list

# Some examples of selection from the list
character_list["Name"]
character_list["Breed"]

character_list[2]
character_list[3]

character_list$Name
character_list$Owner
```

A list is considered a single data object, so it can be returned from a function via a `return` statement.
Modify function `desc_stats` to **return** its four outputs, rather than printing them. Call your function and use any of the syntactic options above to explore the contents of the returned object. 

My solution is:

```{r output multiple items}
# Modify the function to return a list
desc_stats <- function(dv_vector)
{
  mean_dv <- mean(dv_vector, na.rm = TRUE)
  sd_dv <- sd(dv_vector, na.rm = TRUE)
  min_dv <- min(dv_vector, na.rm = TRUE)
  max_dv <- max(dv_vector, na.rm = TRUE)

  result_list <- list(Mean = mean_dv,
                      Sd = sd_dv,
                      Min = min_dv,
                      Max = max_dv)
  
  return(result_list)
}


flipper_desc_stats <- desc_stats(penguins$flipper_length_mm)

flipper_desc_stats$Mean

flipper_desc_stats[2]

min_max <- c(flipper_desc_stats["Min"], flipper_desc_stats["Max"])

print(min_max)


```


# Scope

When you specify a variable in R it will start trying to find something with that name within the global environment (displayed in the _Environment_ tab in RStudio). In the case of functions, any variable defined in the function (including through its arguments) stays within the function (a separate local environment specific to the function). If however in the body of the function you refer to a variable that hasn't been defined in the function, R will start looking at the global environment and if it finds a variable of the same name you've created outside of the function, it will use the value that is stored within it. This behaviour can cause issues.


Here is an example, where the function needs a value for `n` but it hasn't been supplied as an argument and there is no default value.

```{r, error=TRUE}
# multiplies the number x by the number n
multiply_by_n <- function(x){
  x * n
}

multiply_by_n(x = 3)
```

In the function body we referred to `n` which wasn't defined anywhere so we got an error.


Let's use the same function again but define `n` outside the function:
```{r}
# multiplies the number x by the number n
multiply_by_n <- function(x){
  x * n
}

# define n in the global environment 
n <- 10

multiply_by_n(x = 3)
```
This time R looks for `n` inside the body but doesn't find it and when it looks into the global environment it finds a variable named `n` and so uses that value.
 
 This time we're going to modify the function to take a second argument called `n`, and also have `n` defined in the global environment:
```{r}
# multiplies the number x by the number n
multiply_by_n <- function(x, n){
  x * n
}

# define n in the global environment
n <- 10

multiply_by_n(x = 3, n = 2)
```
 
The value of `n` that was used was the value supplied as the argument, rather than the version that was defined in the global environment. Thus a function will use a locally defined variable when one exists, but if it can't find one, it will look in the global environment.

This behaviour can lead to subtle errors in your code. For example, earlier in this module we wrote a version of user-defined function `calc_area` that tried to reference variable `height` without having passed it in as an argument. When R was unable to find `height`, it alerted us with an error message, and we were able to identify and correct the error in the code (the missing input argument `height`). If, at that point, we had already created a variable `height` outside the function (for any reason), R would have happily **used that existing variable when executing the function code body** and we would not have realised there was an error. Of course, if we ever called the bad function when we hadn't previously created `height`, our program would crash. Some programmers use special variable coding styles -- for example prefixing all input arguents with underscore -- to prevent confusion between local and global variables. It is very important, when working in R to be constantly aware of which variables exist in the global environment. Keep your eye on the Environment tab as you write functions.



# Complex Program Behaviour

In very simple programs, we write a set of commands which are executed in sequence -- first statement, second statement, third statement, etc. Each time we run the code, the exact same set of commands are executed, in the exact same sequence. 

However, as computational problems become more complex, the behaviour of our code solutions must also become more complex. We may have sections of code that we want to repeat some varying number times depending on our data, or sections of code that we only want to execute under certain conditions.

The specific time course of program execution is called **flow of control**, and R provides many syntactic features that allow us to manage it. Flow of control generally depends on the program's **state** -- the specific set of variables and their values at each point during program execution. 

For example, when we import a csv file into a data frame, that data frame is part of the state. We may want to execute a function once for each row in the data frame. If we import a data frame with 53 rows, we want the function to be called 53 times. Thus the number of times the function is called (the flow of control) depends on the state. Similarly, when we write a function that accepts data input arguments, the value of those arguments passed in at function call are part of the state, and can be different each time the function is called. Frequently we will want to take different actions in the function depending on that state. 

We have described the three primary flow of control constructs:

-  Sequential execution: Statements are executed in order
-  Iteration: A set of statements is repeated a number of times
-  Conditional execution: A set of statements may or may not be executed, depending on the state.

We have already seen, and written, code that contains only sequential exection. We will now look at the syntactic tools for conditional execution. In our next module, we will cover iteration, which is syntactically more complex.

## If statements

All modern programming languages allow you to wrap a block of code in an **if statement**. If statements contain a **condition**, which is an expression which evaluates to either true or false. At runtime, the condition is evaluated. If it is true, the block of code is executed; if it is false, the block of code is not executed. Conditional code blocks in R are delineated with curly brackets. The conditional expression is delineated with round brackets.

Schematically:

```{r conditional pattern, eval = FALSE}
if (condition) {
  # code here is only run if condition was TRUE
}
```

Consider this toy example:

```{r simple if}
did_I_pass_the_paper <- function(my_mark)
{
  if (my_mark > 50) {
    print("You passed!")
  }
}

# This call generates no output
did_I_pass_the_paper(14)

# This call generates output
did_I_pass_the_paper(73)
```


## Adding an alternative with `else`

Frequently we wish to define two behaviours for a conditional expression -- one for when it is true and another for when it is false. In R we do this with the keyword `else`

```{r if else pattern, eval = FALSE}
if (condition) {
  # code here is only run if condition was TRUE
} else {
  # code here is only run if condition was FALSE
}


```

For example:
```{r if else}
did_I_pass_the_paper <- function(my_mark)
{
  if (my_mark > 50) {
    print("You passed! :)")
  } else {
    print("Sorry, you didn't pass. :(")
  }
}

# This call runs the else block
did_I_pass_the_paper(14)

# This call runs the if block
did_I_pass_the_paper(73)
```


NB: Pay careful attention to the placement of the curly brackets for both the if block and the else block. The first curly bracket *must sit on the same line as the if statement*. The `else` keyword must be on the same line as the closing curly bracket of the `if` block, and must be followed, on that same line, by the opening curly bracket of the `else` block. It is a known peculiarity of the R language that it is **extremely fussy** about this rule. No use fighting it; just follow the rule.

An extension of `else` is the `else if` construct that lets you link a series of conditions. The conditions are tested one at a time from the top and the first condition that evaluates to `TRUE` is the only code block that gets run. For example:


```{r else if}
bmi_category <- function(bmi)
{
    if(bmi > 30){
      print("obese")
    } else if (bmi > 25){
      print("overweight")
    } else if (bmi > 20){
      print("healthy")
    } else {
      print("underweight")
  }
}


bmi_category(22)

bmi_category(18)
```

Conditional statements can be nested. That is, inside the `if` or `else` block, you can have more conditional statements, each of which can have `if` and `else` blocks, each of which can in turn have nested conditional statements, and so on. For complex computations, the conditional logic can become very elaborate, and needs to be approached carefully. I find it helpful in these cases to sketch out a flow chart showing all the different outcomes based on state, and use that as a pattern for writing and arranging the various code blocks.


## Complex conditional expressions

In our previous examples, we wrote conditional expressions using the > (greater than) operator. The expression `my_mark > 50` evaluates to either true or false (i.e. a **boolean value**). If variable `my_mark` is greater than 50, the expression returns true, otherwise it returns false. Greater than is a **comparison operator**. The R comparison operators are:

Operator | Meaning
---------|---------
`==` | equal to
`!=` | not equal to
`<` | less than
`<=` | less than or equal to
`>` | greater than
`>=` | greater than or equal to


The **Boolean logic operators** can be used in to modify or combine conditional expressions.


Boolean Operation | Symbol in R
---|---
NOT | !
OR | \|
AND | &


For example, the following function might be used to check that a value entered as a penguin body mass was within the expected weight range for the species.

```{r compound conditional}
# Chinstrap penguins weight between 3 and 5 kg

check_chinstrap_weight <- function(weight_to_check)
{
  if ((weight_to_check >= 3000) & (weight_to_check <= 5000)) {
    print("Weight ok")
  } else {
    print("That's probably a typo")
  }
}


check_chinstrap_weight(4100)


check_chinstrap_weight(410)

```

The result of the NOT, AND, and OR are shown in the logic table:

Statement | Becomes
---|---|---|---
  !TRUE | `r !TRUE`
 !FALSE | `r !FALSE` 
TRUE & TRUE | `r TRUE & TRUE`
TRUE & FALSE | `r TRUE & FALSE`
FALSE & TRUE | `r FALSE & TRUE`
FALSE & FALSE | `r FALSE & FALSE`
TRUE \| TRUE | `r TRUE | TRUE`
TRUE \| FALSE | `r TRUE | FALSE`
FALSE \| TRUE | `r FALSE | TRUE`
FALSE \| FALSE | `r FALSE | FALSE`



## Conditional exercise
Write and test a function that determines whether a student receives a passing grade on an assessment. The function should accept two input args: the number of marks earned, and the total number of marks possible for the assessment. The student must earn 50% of the available marks in order to pass. For example, if a student earns 18 marks on a 20 mark assessment they pass, but if they earn only 8 marks, they fail. Your function should print "Pass" or "Fail" as appropriate based on the input data. 

```{r exercise solution}

pass_check <- function(earned, possible)
{
  mark <- earned/possible
  
  if (mark > 0.5){
    print("Pass")
  } else {
    print("Fail")
  }
  
}


pass_check(18, 20)
pass_check(8,20)
pass_check(18,100)
pass_check(8,10)

```


# Function Error Checking

We can use conditional control statements to provide error checking in user-defined functions.

## Failing

One of the sayings in programming is "if it's going to fail, it's best to fail early". That is, if we know that our function requires a specific input data type, we want to program **defensively** so that our function "fails" before it encounters the error. As part of our defensive programming we can provide informative error messages, rather than rely on R's generic ones.


In the following example, we check that the data coming into our function is numeric. If it is not, we use function `stop`, to exit the function, displaying our informative message.


```{r, error=TRUE}
# Returns the provided number doubled
double_number <- function(x) {
  if( !is.numeric(x) ){
    stop("x needs to be a number.")
  }
  x * 2
}


double_number(4)

double_number("a")
```

N.B. Check the appendix for more on data types.

# Conclusion
This document has presented an introduction to creating your own functions and implementing conditional flow of control. For more detail, see the free online books [Advanced R](https://adv-r.hadley.nz) and [R packages](https://r-pkgs.org).


# What's Next

Fill in the module feedback form [https://tinyurl.com/r4ssp-module-fb](https://tinyurl.com/r4ssp-module-fb).

Next move onto [R for Data Science Chapter 21 - Iteration (https://r4ds.had.co.nz/iteration.html)](https://r4ds.had.co.nz/iteration.html) to learn about how we can reduce code duplication, by repeating code efficiently through iteration or 'loops'. As always, if you run into trouble, let us know.

# Appendix

## Data types

Not all data are created equal, in R this concept is captured by _data types_. For a vector, all values must be of a single data type.

The main data types that you will encounter in R are:

- _Logical_ ( `c(TRUE, FALSE)`)
- _Numeric_ - also called _Real_ or _Double_ (Numbers that have a floating point (decimal) representation e.g. `c(1, 3.6, 1e3)`)
- _Character_ - also called _String_ (anything inside matching opening and closing quotes (single or double) e.g. `c("a", "some words", "animal"`))

There are 3 other less common:

- _Integer_ (integers `c(1L, 4L, -3L)`)
- _Complex_ (Complex numbers e.g. `c(0+3i, 4i, -2-5i)`)
- _Raw_ (the bytes of a file)


Each data type is known in R as an _atomic_ vector. R has built in functions to be able to determine the data type of a vector, `typeof()` is the best one to use, but others such as `str()` and `class()` can be used.

There is also a series of functions that let us do explicit checking for a data type which will return `TRUE` or `FALSE`:

- `is.logical()`
- `is.numeric()` or `is.double()`
- `is.character()`
- `is.integer()`
- `is.complex()`
- `is.raw()`



### Data Type Coercion

In R, when doing operations on multiple vectors, they all need to be the same data type - but how can this work if we have for example a numeric vector and a character vector? _Coercion_ is how R deals with trying to operate on two vectors of different data types. What this means in practice is that R will convert the data type of a vector in a defined manner such that we end up will all of the same type and follows a "lowest common data type"
 approach. Using the 3 main data types from above, the following is the order in which they will be coerced into the next data type: _logical_ -> _numeric_ -> _character_.
 
This principle applies when you try to create a vector of mixed data types too, R will coerce everything until it is a single data type.
 

See if you can predict what data type the result will be (you can check by using `typeof()`: 
```{r, eval = FALSE}
# logical and numeric
c(4, TRUE, 5)

# numeric and character
c(1, 3, "A")

# logical and character
c(FALSE, "cat","frog")

# mixed
c("see", TRUE, 4.8)

# tricky
c("1.3", "4", TRUE)
```

We can also explicitly force coercion into a particular data type by using the following:

- `as.logical()`
- `as.numeric()`
- `as.character()`

The other data types also have similarly named functions. When going against the normal direction of coercion, it is important to realise that if your data doesn't have a representation in that data type, it will become _NA_ (missing).
