Associated Material

Zoom Notes: Zoom Notes 08 - Repeating Code

Readings:


Introduction

In the previous module we learned to make our code more modular by organising it into functions, each of which encapsulates some logically distinct task. By repeatedly calling a function, we can execute the same code many times, while only having to type it out once. This is code reuse and is an important goal in efficient software development.

We can also achieve code reuse through iteration. Iterative constructs (more casually called loops) allow us to write a code block once, then instruct R to execute it a specified number of times. The number of iterations can be made to vary depending on the state. (Recall that state is the set of values of all the variables in the environment when a command is executed.) For example, we can write a loop that iterates as many times as there are rows in a data frame that we read from a file before the loop starts.

In R, as in most modern programming languages, there are different types of loops, with subtle differences in behaviour. We begin with the most general kind of loop – the for loop.

Syntax

In R a for loop has the following structure:

for (variable_name in some_kind_of_sequence)
{
    Code block to be repeated. Can be as long as required.
}


The keywords for and in are required, as are the round brackets in the loop header, and the curly brackets that delineate the code body.

The some_kind_of sequence is usually a vector. The code body is executed as many times as there are elements in the vector.

We will discuss the role of the variable_name element later in the module.

(It is also syntactically acceptable to place the opening curly bracket on the same line as the for loop header, after the closing round bracket, separated by a space.)

Basic for loop

A popular cheer in Australian sports is “Aussie!, Aussie!, Aussie!”. Assume that (for some inexplicable reason) you wished to print this cheer. That is, you want to print “Aussie!” to the console three times. Using only sequential code you would write:

print("Aussie!")
print("Aussie!")
print("Aussie!")

You are executing the identical line of code three times. Having to type the same line of code multiple times is boring, and increases the number of opportunities for typos and bugs to sneak in. (In this toy example, we are only repeating one line of code three times; in a real computational situation we might need to repeat dozens of lines of code hundreds of times.) Using a for loop, we can achieve the same output while only typing the command once.

for (cheer in c(1,2,3))
{
  print("Aussie")
}
#> [1] "Aussie"
#> [1] "Aussie"
#> [1] "Aussie"

Match the parts of this loop to the schematic, noting the position of the keywords for and in, the round brackets, and the curly brackets. Note the some_kind_of_sequence element which, in this case, is the vector created by c(1,2,3). Since there are three elements in the vector, the code body is executed three times.

Vector sequences

If we wanted to print “Aussie!” 5 times, we could increase the length of the vector to 5. If we wanted to print it 100 times (could happen) we could increase the length of the vector to 100, but using function c() for this is too cumbersome. In R, we have an alternative for generating a sequence of numbers – the : operator.

small_seq <- 1:5
small_seq
#> [1] 1 2 3 4 5

big_seq <- 1:50
big_seq
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

There are more powerful R functions for generating sequences (see, for example seq and rep) but for basic for loops, the : operator is sufficient. We can extend our cheer-printing for loop:

for (cheer in 1:10)
{
  print("Aussie")
}
#> [1] "Aussie"
#> [1] "Aussie"
#> [1] "Aussie"
#> [1] "Aussie"
#> [1] "Aussie"
#> [1] "Aussie"
#> [1] "Aussie"
#> [1] "Aussie"
#> [1] "Aussie"
#> [1] "Aussie"


The Loop Driver

Recall the schematic for a for loop in R:

for (variable_name in some_kind_of_sequence)
{
    Code to be repeated. Can be as long as required.
}


The variable_name element can be any legal R variable name. This element is called the loop driver. Inside the code body of the for loop, the loop driver variable is always available. For example, in our “Aussie!, Aussie! Aussie!” example, our loop driver was called cheer. In our code body we could have referred to variable cheer.

As discussed above, the for loop code body is executed as many times as there are elements in the sequence in the header. At each pass through the loop, the loop driver variable automatically takes on the value of the corresponding element of the sequence. That is, in the first pass of the loop, the loop driver holds the value of the first element of the sequence; in the second pass of the loop, the loop driver holds the value of the second element of the sequence, and so on.

We illustrate this by extending our previous example to print the value of loop driver cheer inside the code body:

for (cheer in 1:3)
{
  print("Aussie")
  print(cheer)
}
#> [1] "Aussie"
#> [1] 1
#> [1] "Aussie"
#> [1] 2
#> [1] "Aussie"
#> [1] 3

Note that you never assign a value to the loop driver – it takes on its values automatically as the for loop runs.

In all our examples so far, the sequence in the for loop header has been numbers from 1 to n. In R, the sequence can be a vector of any type, and the loop driver will take on whatever values the sequence contains. For example, we can drive a for loop with a vector of string:

sports <- c("Rugby", "Cycling", "Ice Skating")
for (current_sport in sports)
{
  # Print the sport name
  print(current_sport)
  
  # Compute the number of characters in the sport name
  print(nchar(current_sport))
}
#> [1] "Rugby"
#> [1] 5
#> [1] "Cycling"
#> [1] 7
#> [1] "Ice Skating"
#> [1] 11


Loop driver as index

Throughout the course we have seen that R relies heavily on managing data in ordered collections (vectors, lists, and data frames). We select elements from those collections using the selection operator [], providing the ordinal position of the element of interest. Since loop driver variables take on each value in their sequence, we can use a for loop to step through each element of a collection in turn.

For example, assume you have a data frame containing one student’s internal and exam marks for the papers taken in one semester:

paper_codes <- c("HUBS192", "ECOL212", "STAT260")
internal_marks <- c(87, 85, 62)
exam_marks <- c(93, 84, 85)

marks_df <- data.frame(PaperCode = paper_codes,
                       InternalMark = as.numeric(internal_marks),
                       ExamMark = as.numeric(exam_marks))

marks_df
#>   PaperCode InternalMark ExamMark
#> 1   HUBS192           87       93
#> 2   ECOL212           85       84
#> 3   STAT260           62       85

We know that we can select all columns from a single row in that data frame using the syntax [row number, ]

# Select and print row 2
selected_row <- marks_df[2, ]
print(selected_row)
#>   PaperCode InternalMark ExamMark
#> 2   ECOL212           85       84

With a for loop, we can use the loop driver to iterate over all the rows, processing each in turn:

for (index in 1:3)
{
  # Use the loop driver to select a row from the data frame
  selected_row <- marks_df[index, ]
  
  # Process the row (just printing in this example)
  print(selected_row)
}
#>   PaperCode InternalMark ExamMark
#> 1   HUBS192           87       93
#>   PaperCode InternalMark ExamMark
#> 2   ECOL212           85       84
#>   PaperCode InternalMark ExamMark
#> 3   STAT260           62       85

Note that if we made our data frame longer than three rows by adding more papers to it, the preceding for loop would not print all the rows; the sequence in that example is always 1, 2, 3 so we see only rows 1, 2, and 3. We can make the loop more general by determining the for loop sequence dynamically using function nrow, which accepts a data frame as an argument, and returns the number of rows in the data frame:

# Display all rows
for (index in 1:nrow(marks_df))
{
  # Use the loop driver to index into the data frame
  selected_row <- marks_df[index, ]
  print(selected_row)
}
#>   PaperCode InternalMark ExamMark
#> 1   HUBS192           87       93
#>   PaperCode InternalMark ExamMark
#> 2   ECOL212           85       84
#>   PaperCode InternalMark ExamMark
#> 3   STAT260           62       85

# Add a row to the data frame
marks_df <- rbind(marks_df, data.frame(PaperCode = "ZOOL316", 
                                       InternalMark = 83, 
                                       ExamMark = 90))


# Repeat the loop -- see four rows
for (index in 1:nrow(marks_df))
{
  # Use the loop driver to index into the data frame
  selected_row <- marks_df[index, ]
  print(selected_row)
}
#>   PaperCode InternalMark ExamMark
#> 1   HUBS192           87       93
#>   PaperCode InternalMark ExamMark
#> 2   ECOL212           85       84
#>   PaperCode InternalMark ExamMark
#> 3   STAT260           62       85
#>   PaperCode InternalMark ExamMark
#> 4   ZOOL316           83       90


Exercise

  1. Write a for loop that iterates over marks_df printing each column in turn.
  2. Your final mark in each paper is computed as 40% of your internal mark plus 60% of your exam mark. Using the technique of your choice, add a new column to data frame marks_df which contains the computed final mark for each paper.
  3. Run your for loop again and confirm that it displays all the columns, including the new one.
# Display all columns
for (index in 1:ncol(marks_df))
{
  # Use the loop driver to index into the data frame
  selected_col <- marks_df[ , index]
  print(selected_col)
}
#> [1] "HUBS192" "ECOL212" "STAT260" "ZOOL316"
#> [1] 87 85 62 83
#> [1] 93 84 85 90


# Add a new column
marks_df$TotalMark <- (0.40 * marks_df$InternalMark) + (0.60 * marks_df$ExamMark) 


# Confirm that we display all four columns
for (index in 1:ncol(marks_df))
{
  # Use the loop driver to index into the data frame
  selected_col <- marks_df[ , index]
  print(selected_col)
}
#> [1] "HUBS192" "ECOL212" "STAT260" "ZOOL316"
#> [1] 87 85 62 83
#> [1] 93 84 85 90
#> [1] 90.6 84.4 75.8 87.2


Nested for loops

In the preceding examples, we have iterated over a data frame processing complete rows, or complete columns. Frequently, it is necessary to iterate over a data frame (or matrix) processing each individual cell in turn. That is, instead of using the [, col] or [row, ] forms of selection, we need to specify both a row index and a column index (i.e. [row, col]). This is very common processing technique in simulation, computer graphics, artificial intelligence, and computational problems requiring matrix algebra.

We iterate over tabular date in an orderly fashion. For example, if we have a 3x3 matrix or data frame we would process the cells in the first row from left to right ([1,1], [1,2], [1,3]), then the cells in the second row ([2,1], [2,2], [2,3]), and finally the cells in the third row ([3,1], [3,1], [3,3]). You can view this pattern as using two loop drivers, one for the row index and one for the column index. While the row driver is 1, we want the column driver to loop through values 1, 2, and 3. Then, we want the row driver to take on 2, and again want the column driver to loop through 1, 2, and 3. Finally we want the row driver to be 3, and the column driver to loop again. We can achieve this by nesting a for loop for columns inside a for loop for rows, as shown below. Note that each for loop has its own loop driver, its own sequence and its own round and curly brackets. The style of indenting the inner for loop is important for maintaining code readability.

# Outer loop.
for (row_index in 1:nrow(marks_df))
{
  # Inner loop -- makes all iterations for each pass through outer loop
  for (col_index in 1:ncol(marks_df))
  {
    # Use both loop drivers to select
    cell_value <- marks_df[row_index, col_index]
    print(cell_value)
  }
}
#> [1] "HUBS192"
#> [1] 87
#> [1] 93
#> [1] 90.6
#> [1] "ECOL212"
#> [1] 85
#> [1] 84
#> [1] 84.4
#> [1] "STAT260"
#> [1] 62
#> [1] 85
#> [1] 75.8
#> [1] "ZOOL316"
#> [1] 83
#> [1] 90
#> [1] 87.2


Exercise

The preceding example prints row-wise. That is, it prints all the values for each row (i.e. for a single paper) before moving to the next row (paper). Modify the code to print column-wise. That is, print down the columns: all the paper names, then all the internal marks, then all the exam marks, then all the final marks.

# Outer loop.
for (col_index in 1:ncol(marks_df))
{
  # Inner loop -- makes all iterations for each pass through outer loop
  for (row_index in 1:nrow(marks_df))
  {
    # Use both loop drivers to select
    cell_value <- marks_df[row_index, col_index]
    print(cell_value)
  }
}
#> [1] "HUBS192"
#> [1] "ECOL212"
#> [1] "STAT260"
#> [1] "ZOOL316"
#> [1] 87
#> [1] 85
#> [1] 62
#> [1] 83
#> [1] 93
#> [1] 84
#> [1] 85
#> [1] 90
#> [1] 90.6
#> [1] 84.4
#> [1] 75.8
#> [1] 87.2


CAUTION

Using a nested for loop to visit every cell in a table is a very common, and very powerful code pattern. However, be aware that it can be computationally expensive (i.e. it can take a long time to run). If your outer loop runs n passes and your inner loop runs m passes, you make a total of n * m passes. When processing a 1000 row x 1000 column table with a nested for loop, the code body of the inner loop is executed one million times, which may be intractable. If you have large tabular data sets that take a long time to process, you should consider leveraging R’s advanced vector processing techniques that can be more efficient than an interior for loop. See, for example, https://rstudio-pubs-static.s3.amazonaws.com/72295_692737b667614d369bd87cb0f51c9a4b.html or http://www.john-ros.com/Rcourse/memory.html for discussions.

While loop

The for loop is used when you want to execute a code body a specific number of times, or iterate over a specific sequence of values. An alternative loop structure – the while loop repeats as long as a given condition evaluates to true.

The schematic for a while loop is as follows:

while (condition) 
{
  # Loop body
}

In an earlier module, we used function rnorm to randomly select a value from a normal distribution with a known mean and standard deviation. Imagine that you want to explore the probability of randomly selecting a number from such a distribution that is more than two standard deviations above the mean (i.e. has a z-score of 2 or more). You could estimate this by repeatedly selecting a number until you achieved the criterion, and counting the number of times you had to select. We wish to loop repeatedly over some logic (selecting and counting), but we don’t know exactly how many times the loop should run, so a for loop is not appropriate. In this situation, we use a while loop.

Before looking at the code sample below, try to work out what the while loop condition will be.

# Define the parameters
distribution_mean <- 100
distribution_sd <- 10

# Prepare a variable to count the passes
count <- 0

# Make your first selection so the loop condition to be evaluated 
# on the first pass
rand_value <- rnorm(1, distribution_mean, distribution_sd)

# The loop. Note the condition. We continue running the loop
# as long as our selected value is less than mean + 2*sd.
# While loops run as long as the loop condition is true.
while (rand_value < distribution_mean + (2 * distribution_sd))
{
  # increment the count because the loop condition was true
  count <- count + 1
  
  # Select again
  rand_value <- rnorm(1, distribution_mean, distribution_sd)
  
} # end of while loop

# Display the result
output <- paste(count, "values were chosen before z-score > 1")
print(output)
#> [1] "8 values were chosen before z-score > 1"

Here is an example of a while loop to count how many iterations it takes to obtain 3 heads in a row:

## Example from R for Data Science - 21.3.4 ##

# Function to simulate head or tail as the result of a coin flip
flip <- function(){
  sample(c("T", "H"), 1)
}

# variables to keep track of key results
flips <- 0
nheads <- 0

# flip a coin until there are three heads and count how many flips were performed.
while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips
#> [1] 15


Infinite while loops

NB: It is essential that your while loop condition will eventually evaluate to false. Consider the following code sample (which we will not run).

# Set a starting value so we can check the condition on the first pass
x <- 10

# The loop
while (x > 0)
{
  # The code body
  x <- x + 1
}

The variable x is initialised to 10, and incremented at each pass through the code body. Its value will therefore always be greater than or equal to 10. The loop condition is x > 0. Since x starts at 10 and increases at each pass, it will always be greater than 0, so the loop condition will always be true, and the loop will never stop. This is an infinite loop. When your code is in an infinite loop, the only way to stop it is to forcibly terminate the program (assuming you recognise what has occurred). In R you can usually do this by clicking the red stop sign in the top right corner of the console window. An infinite loop that produces a lot of variables can consume memory to the point where the machine will crash. This is bad. Always check that your loop condition is guaranteed to eventually evaluate to false before running code with a while loop.

Map

In the package purrr (part of package tidyverse), there are a collection of map functions which iterate over a vector or list, applying a function to each element. This is a very succinct syntax, which achieves the same result as calling the function inside a for loop, without the overhead of writing out the loop structure.

library(purrr)

farenheit_to_celcius <- function(temp_f){
  temp_c <- (temp_f - 32) * 5/9
  return(temp_c)
}

my_temps_f <- c(90, 78, 88, 89, 77)

# map applies the function (2nd argument) to each element of the vector (1st argument) 
# and returns the results as a list. 
my_temps_c_list <- map(my_temps_f, farenheit_to_celcius)
my_temps_c_list
#> [[1]]
#> [1] 32.22222
#> 
#> [[2]]
#> [1] 25.55556
#> 
#> [[3]]
#> [1] 31.11111
#> 
#> [[4]]
#> [1] 31.66667
#> 
#> [[5]]
#> [1] 25

The map argument names are .x and .f, so the call to map above could also be written as my_temps_c_list <- map(.x = my_temps_f, .f = farenheit_to_celcius)

Note that when providing a function as an argument, give only the function name. Do not follow the function name with () as for a function call.

Map and friends

The basic form of map above, returns the results in a list. There are suffix versions of map that return the results as a specific data type.

  • map() makes a list.
  • map_lgl() makes a logical vector.
  • map_int() makes an integer vector.
  • map_dbl() makes a double vector.
  • map_chr() makes a character vector.

These suffix versions will give an error if the data type of the results doesn’t match the intended return type. This is useful because you can write code to process the results further, confident that they are of a specific data type.

Apply

Base R has a set of built in functions that duplicate the behaviour of purrr::map and its suffix functions. These are the apply family: apply, lapply, sapply, mapply, and tapply. The functions differ primarily in the structure of the data they return. See, for example, http://adv-r.had.co.nz/Functionals.html for more detail.

Function lapply is analogous to purrr::map():

farenheit_to_celcius <- function(temp_f){
  temp_c <- (temp_f -32) * 5/9
  return(temp_c)
}

my_temps_f <- c(90, 78, 88, 89, 77)

# lapply example
lapply_my_temps_c <- lapply(X = my_temps_f, FUN = farenheit_to_celcius)
lapply_my_temps_c
#> [[1]]
#> [1] 32.22222
#> 
#> [[2]]
#> [1] 25.55556
#> 
#> [[3]]
#> [1] 31.11111
#> 
#> [[4]]
#> [1] 31.66667
#> 
#> [[5]]
#> [1] 25

Note that some of the apply family return different data structures depending on the type of the input data. This can make it challenging to know beforehand what the output is going to look like - unlike the suffix versions of purrr::map

Conclusion

In this module we covered using for and while loops to repeat code fragments efficiently. We saw that functions in the map and apply families provide an equivalent, yet more succinct syntax, when the repeated code is a single function. Combining loops with functions and conditional flow of control from last week’s session, you can create code that is modular, reusable, and maintainable.

What’s Next

Next week we conclude the formal content of R4SSP with a discussion of R workflows, including project structure, incremental development, and effective debugging techniques.

Please fill in the module feedback form https://tinyurl.com/r4ssp-module-fb.

---
title: "Repeating Code"
date: "Semester 2, 2022"
output:
  html_document:
    toc: true
    toc_float: true
    toc_depth: 3
    code_download: true
    code_folding: show
---

```{r setup, include=FALSE}
library(knitr)

knitr::opts_chunk$set(
  comment = "#>",
  fig.path = "figures/08/", # use only for single Rmd files
  collapse = TRUE,
  echo = TRUE
)

# keep the resulting html the same since sampling is used later on in the lesson
set.seed(42)

```

> #### Associated Material
>
> Zoom Notes: [Zoom Notes 08 - Repeating Code](zoom_notes_08.html)
>
> Readings:
>
> - [R for Data Science - Chapter 21](https://r4ds.had.co.nz/iteration.html)

\

# Introduction

In the previous module we learned to make our code more **modular** by organising it into functions, each of which encapsulates some logically distinct task. By repeatedly calling a function, we can execute the same code many times, while only having to type it out once. This is **code reuse** and is an important goal in efficient software development.

We can also achieve code reuse through **iteration**. Iterative constructs (more casually called **loops**) allow us to write a code block once, then instruct R to execute it a specified number of times. The number of iterations can be made to vary depending on the **state**. (Recall that **state** is the set of values of all the variables in the environment when a command is executed.) For example, we can write a loop that iterates as many times as there are rows in a data frame that we read from a file before the loop starts.

In R, as in most modern programming languages, there are different types of loops, with subtle differences in behaviour. We begin with the most general kind of loop -- the **for loop**.


# Syntax

In R a **for loop** has the following structure:


| **for** (*variable_name* **in** *some_kind_of_sequence*)
| {
|     *Code block to be repeated. Can be as long as required.*
| }

\

The keywords **for** and **in** are required, as are the round brackets in the **loop header**, and the curly brackets that delineate the code body.

The *some_kind_of sequence* is usually a vector. The code body is executed as many times as there are elements in the vector.

We will discuss the role of the *variable_name* element later in the module.

(It is also syntactically acceptable to place the opening curly bracket on the same line as the for loop header, after the closing round bracket, separated by a space.)


# Basic for loop

A popular cheer in Australian sports is "Aussie!, Aussie!, Aussie!". Assume that (for some inexplicable reason) you wished to print this cheer. That is, you want to print "Aussie!" to the console three times. Using only sequential code you would write:

```{r sequential, eval=FALSE}
print("Aussie!")
print("Aussie!")
print("Aussie!")
```

You are executing the identical line of code three times. Having to type the same line of code multiple times is boring, and increases the number of opportunities for typos and bugs to sneak in. (In this toy example, we are only repeating one line of code three times; in a real computational situation we might need to repeat dozens of lines of code hundreds of times.) Using a **for loop**, we can achieve the same output while only typing the command once.

```{r basic loop}
for (cheer in c(1,2,3))
{
  print("Aussie")
}
```

Match the parts of this loop to the schematic, noting the position of the keywords **for** and **in**, the round brackets, and the curly brackets. Note the *some_kind_of_sequence* element which, in this case, is the vector created by `c(1,2,3)`. Since there are three elements in the vector, the code body is executed three times. 

## Vector sequences

If we wanted to print "Aussie!" 5 times, we could increase the length of the vector to 5. If we wanted to print it 100 times (could happen) we could increase the length of the vector to 100, but using function `c()` for this is too cumbersome. In R, we have an alternative for generating a sequence of numbers -- the `:` operator.

```{r seq }
small_seq <- 1:5
small_seq

big_seq <- 1:50
big_seq
```

There are more powerful R functions for generating sequences (see, for example `seq` and `rep`) but for basic for loops, the `:` operator is sufficient. We can extend our cheer-printing for loop:

```{r long cheer}
for (cheer in 1:10)
{
  print("Aussie")
}
```

\

# The Loop Driver

Recall the schematic for a for loop in R:

| **for** (*variable_name* **in** *some_kind_of_sequence*)
| {
|     *Code to be repeated. Can be as long as required.*
| }

\
The *variable_name* element can be any legal R variable name. This element is called the **loop driver**. Inside the code body of the for loop, the loop driver variable **is always available**. For example, in our "Aussie!, Aussie! Aussie!" example, our loop driver was called *cheer*. In our code body we could have referred to variable `cheer`. 

As discussed above, the for loop code body is executed as many times as there are elements in the sequence in the header. At each pass through the loop, the loop driver variable automatically takes on **the value of the corresponding element of the sequence**. That is, in the first pass of the loop, the loop driver holds the value of the first element of the sequence; in the second pass of the loop, the loop driver holds the value of the second element of the sequence, and so on.

We illustrate this by extending our previous example to print the value of loop driver `cheer` inside the code body:

```{r long cheer 2}
for (cheer in 1:3)
{
  print("Aussie")
  print(cheer)
}
```

Note that you **never assign a value to the loop driver** -- it takes on its values automatically as the for loop runs.

In all our examples so far, the sequence in the for loop header has been numbers from 1 to *n*. In R, the sequence can be a vector of any type, and the loop driver will take on whatever values the sequence contains. For example, we can drive a for loop with a vector of string:

```{r seq of string}
sports <- c("Rugby", "Cycling", "Ice Skating")
for (current_sport in sports)
{
  # Print the sport name
  print(current_sport)
  
  # Compute the number of characters in the sport name
  print(nchar(current_sport))
}
  
```

\

## Loop driver as index

Throughout the course we have seen that R relies heavily on managing data in ordered collections (vectors, lists, and data frames). We select elements from those collections using the selection operator `[]`, providing the ordinal position of the element of interest. Since loop driver variables take on each value in their sequence, we can use a for loop to step through each element of a collection in turn.

For example, assume you have a data frame containing one student's internal and exam marks for the papers taken in one semester: 

```{r make data frame}
paper_codes <- c("HUBS192", "ECOL212", "STAT260")
internal_marks <- c(87, 85, 62)
exam_marks <- c(93, 84, 85)

marks_df <- data.frame(PaperCode = paper_codes,
                       InternalMark = as.numeric(internal_marks),
                       ExamMark = as.numeric(exam_marks))

marks_df

```

We know that we can select all columns from a single row in that data frame using the syntax [*row number*,  ]

```{r select row}
# Select and print row 2
selected_row <- marks_df[2, ]
print(selected_row)

```

With a for loop, we can use the loop driver to iterate over all the rows, processing each in turn:

```{r simple driver as index}
for (index in 1:3)
{
  # Use the loop driver to select a row from the data frame
  selected_row <- marks_df[index, ]
  
  # Process the row (just printing in this example)
  print(selected_row)
}

```

Note that if we made our data frame longer than three rows by adding more papers to it, the preceding for loop **would not** print all the rows; the sequence in that example is always `1, 2, 3` so we see only rows 1, 2, and 3. We can make the loop more general by determining the for loop sequence **dynamically** using function `nrow`, which accepts a data frame as an argument, and returns the number of rows in the data frame:

```{r using nrow}
# Display all rows
for (index in 1:nrow(marks_df))
{
  # Use the loop driver to index into the data frame
  selected_row <- marks_df[index, ]
  print(selected_row)
}

# Add a row to the data frame
marks_df <- rbind(marks_df, data.frame(PaperCode = "ZOOL316", 
                                       InternalMark = 83, 
                                       ExamMark = 90))


# Repeat the loop -- see four rows
for (index in 1:nrow(marks_df))
{
  # Use the loop driver to index into the data frame
  selected_row <- marks_df[index, ]
  print(selected_row)
}
```

\

# Exercise

1. Write a for loop that iterates over marks_df printing **each column** in turn.
2. Your final mark in each paper is computed as 40% of your internal mark plus 60% of your exam mark. Using the technique of your choice, add a new column to data frame marks_df which contains the computed final mark for each paper.
3. Run your for loop again and confirm that it displays *all* the columns, including the new one.

```{r ex 1 solution}
# Display all columns
for (index in 1:ncol(marks_df))
{
  # Use the loop driver to index into the data frame
  selected_col <- marks_df[ , index]
  print(selected_col)
}


# Add a new column
marks_df$TotalMark <- (0.40 * marks_df$InternalMark) + (0.60 * marks_df$ExamMark) 


# Confirm that we display all four columns
for (index in 1:ncol(marks_df))
{
  # Use the loop driver to index into the data frame
  selected_col <- marks_df[ , index]
  print(selected_col)
}
```

\

# Nested for loops

In the preceding examples, we have iterated over a data frame processing complete rows, or complete columns. Frequently, it is necessary to iterate over a data frame (or matrix) processing each *individual cell* in turn. That is, instead of using the `[, col]` or `[row, ]` forms of selection, we need to specify both a row index and a column index (i.e. `[row, col]`). This is very common processing technique in simulation, computer graphics, artificial intelligence, and computational problems requiring matrix algebra.

We iterate over tabular date in an orderly fashion. For example, if we have a 3x3 matrix or data frame we would process the cells in the first row from left to right ([1,1], [1,2], [1,3]), then the cells in the second row ([2,1], [2,2], [2,3]), and finally the cells in the third row ([3,1], [3,1], [3,3]). You can view this pattern as using **two loop drivers**, one for the row index and one for the column index. While the row driver is 1, we want the column driver to loop through values 1, 2, and 3. Then, we want the row driver to take on 2, and again want the column driver to loop through 1, 2, and 3. Finally we want the row driver to be 3, and the column driver to loop again. We can achieve this by **nesting** a for loop for columns **inside** a for loop for rows, as shown below. Note that each for loop has its own loop driver, its own sequence and its own round and curly brackets. The style of indenting the **inner for loop** is important for maintaining code readability.

```{r nested for row-wise}
# Outer loop.
for (row_index in 1:nrow(marks_df))
{
  # Inner loop -- makes all iterations for each pass through outer loop
  for (col_index in 1:ncol(marks_df))
  {
    # Use both loop drivers to select
    cell_value <- marks_df[row_index, col_index]
    print(cell_value)
  }
}
```

\

## Exercise
The preceding example prints *row-wise*. That is, it prints all the values for each row (i.e. for a single paper) before moving to the next row (paper). Modify the code to print *column-wise*. That is, print down the columns: all the paper names, then all the internal marks, then all the exam marks, then all the final marks.

```{r nested for col-wise}
# Outer loop.
for (col_index in 1:ncol(marks_df))
{
  # Inner loop -- makes all iterations for each pass through outer loop
  for (row_index in 1:nrow(marks_df))
  {
    # Use both loop drivers to select
    cell_value <- marks_df[row_index, col_index]
    print(cell_value)
  }
}
```

\

### CAUTION

Using a nested for loop to visit every cell in a table is a very common, and very powerful code pattern. However, be aware that it can be computationally expensive (i.e. it can take a long time to run). If your outer loop runs `n` passes and your inner loop runs `m` passes, you make a total of `n * m` passes. When processing a 1000 row x 1000 column table with a nested for loop, the code body of the inner loop is executed one million times, which may be intractable. If you have large tabular data sets that take a long time to process, you should consider leveraging R's advanced vector processing techniques that can be more efficient than an interior for loop. See, for example, https://rstudio-pubs-static.s3.amazonaws.com/72295_692737b667614d369bd87cb0f51c9a4b.html or http://www.john-ros.com/Rcourse/memory.html for discussions.

# While loop

The **for loop** is used when you want to execute a code body a specific number of times, or iterate over a specific sequence of values. An alternative loop structure -- the **while loop** repeats as long as a given condition evaluates to true.

The schematic for a `while` loop is as follows:

```{r, eval = FALSE}
while (condition) 
{
  # Loop body
}
```

In an earlier module, we used function `rnorm` to randomly select a value from a normal distribution with a known mean and standard deviation. Imagine that you want to explore the probability of randomly selecting a number from such a distribution that is more than two standard deviations above the mean (i.e. has a z-score of 2 or more). You could estimate this by repeatedly selecting a number until you achieved the criterion, and counting the number of times you had to select. We wish to loop repeatedly over some logic (selecting and counting), but we don't know exactly how many times the loop should run, so a for loop is not appropriate. In this situation, we use a **while loop**.

**Before looking at the code sample below, try to work out what the while loop condition will be.**


```{r while looop with rnorm}
# Define the parameters
distribution_mean <- 100
distribution_sd <- 10

# Prepare a variable to count the passes
count <- 0

# Make your first selection so the loop condition to be evaluated 
# on the first pass
rand_value <- rnorm(1, distribution_mean, distribution_sd)

# The loop. Note the condition. We continue running the loop
# as long as our selected value is less than mean + 2*sd.
# While loops run as long as the loop condition is true.
while (rand_value < distribution_mean + (2 * distribution_sd))
{
  # increment the count because the loop condition was true
  count <- count + 1
  
  # Select again
  rand_value <- rnorm(1, distribution_mean, distribution_sd)
  
} # end of while loop

# Display the result
output <- paste(count, "values were chosen before z-score > 1")
print(output)

```

Here is an example of a `while` loop to count how many iterations it takes to obtain 3 heads in a row:

```{r, echo = TRUE, eval = TRUE}
## Example from R for Data Science - 21.3.4 ##

# Function to simulate head or tail as the result of a coin flip
flip <- function(){
  sample(c("T", "H"), 1)
}

# variables to keep track of key results
flips <- 0
nheads <- 0

# flip a coin until there are three heads and count how many flips were performed.
while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips

```

\

## Infinite while loops

NB: It is essential that your while loop condition will **eventually evaluate to false**. Consider the following code sample (which we will *not* run).

```{r infinite loop, eval = FALSE}
# Set a starting value so we can check the condition on the first pass
x <- 10

# The loop
while (x > 0)
{
  # The code body
  x <- x + 1
}

```

The variable x is initialised to 10, and incremented at each pass through the code body. Its value will therefore always be greater than or equal to 10. The loop condition is `x > 0`. Since x starts at 10 and increases at each pass, it will always be greater than 0, so the loop condition will always be true, and the loop will never stop. This is an **infinite loop**. When your code is in an infinite loop, the only way to stop it is to forcibly terminate the program (assuming you recognise what has occurred). In R you can usually do this by clicking the red stop sign in the top right corner of the console window. An infinite loop that produces a lot of variables can consume memory to the point where the machine will crash. This is bad. Always check that your loop condition is guaranteed to eventually evaluate to false before running code with a while loop.

# Map

In the package `purrr` (part of package tidyverse), there are a collection of `map` functions which iterate over a vector or list, applying a function to each element. This is a very succinct syntax, which achieves the same result as calling the function inside a for loop, without the overhead of writing out the loop structure.

```{r simple map}
library(purrr)

farenheit_to_celcius <- function(temp_f){
  temp_c <- (temp_f - 32) * 5/9
  return(temp_c)
}

my_temps_f <- c(90, 78, 88, 89, 77)

# map applies the function (2nd argument) to each element of the vector (1st argument) 
# and returns the results as a list. 
my_temps_c_list <- map(my_temps_f, farenheit_to_celcius)
my_temps_c_list

```

The `map` argument names are `.x` and `.f`, so the call to `map` above could also be written as `my_temps_c_list <- map(.x = my_temps_f, .f = farenheit_to_celcius)`

Note that when providing a function as an argument, give **only the function name**. Do not follow the function name with () as for a function call.


## Map and friends

The basic form of `map` above, returns the results in a list. There are suffix versions of `map` that return the results as a specific data type.

- `map()` makes a list.
- `map_lgl()` makes a logical vector.
- `map_int()` makes an integer vector.
- `map_dbl()` makes a double vector.
- `map_chr()` makes a character vector.

These suffix versions will give an error if the data type of the results doesn't match the intended return type. This is useful because you can write code to process the results further, confident that they are of a specific data type. 

# Apply

Base R has a set of built in functions that duplicate the behaviour of `purrr::map` and its suffix functions. These are the **apply family**: `apply`, `lapply`, `sapply`, `mapply`, and `tapply`. The functions differ primarily in the structure of the data they return. See, for example, http://adv-r.had.co.nz/Functionals.html for more detail.

Function `lapply` is analogous to `purrr::map()`:

```{r}
farenheit_to_celcius <- function(temp_f){
  temp_c <- (temp_f -32) * 5/9
  return(temp_c)
}

my_temps_f <- c(90, 78, 88, 89, 77)

# lapply example
lapply_my_temps_c <- lapply(X = my_temps_f, FUN = farenheit_to_celcius)
lapply_my_temps_c

```

Note that some of the apply family return different data structures depending on the type of the input data. This can make it challenging to know beforehand what the output is going to look like - unlike the suffix versions of `purrr::map`

# Conclusion

In this module we covered using for and while loops to repeat code fragments efficiently. We saw that functions in the map and apply families provide an equivalent, yet more succinct syntax, when the repeated code is a single function. Combining loops with functions and conditional flow of control from last week's session, you can create code that is modular, reusable, and maintainable. 

# What's Next
Next week we conclude the formal content of R4SSP with a discussion of R workflows, including project structure, incremental development, and effective debugging techniques.


Please fill in the module feedback form [https://tinyurl.com/r4ssp-module-fb](https://tinyurl.com/r4ssp-module-fb).

 
