Functional programming in R

Iterating with functions

files <- c(
  "data/weird_data1.xlsx", 
  "data/weird_data2.xlsx", 
  "data/weird_data3.xlsx"
)

weird_data <- map(files, read_weird_excel) |> 
  bind_rows()

Your Turn 5

Write a function that returns the mean and standard deviation of a numeric vector.

Give the function a name

Find the mean and SD of `x`

Map your function to `measurements`

Your Turn 5

mean_sd <- function(x) {
  x_mean <- mean(x)
  x_sd <- sd(x)
  tibble(mean = x_mean, sd = x_sd)
}
  
map(measurements, mean_sd)

Your Turn 5

$blood_glucose
# A tibble: 1 × 2
   mean    sd
  <dbl> <dbl>
1  137.  6.84

$age
# A tibble: 1 × 2
   mean    sd
  <dbl> <dbl>
1  39.5  3.84

$heartrate
# A tibble: 1 × 2
   mean    sd
  <dbl> <dbl>
1  83.0  12.6

Three ways to pass functions to `map()`

pass directly to map()
use an anonymous function
use a lambda (\() or ~)

map(
  .x,
  mean,
  na.rm = TRUE
)map(
  .x,
  mean,
  na.rm = TRUE
)

map(
  .x,
  function(.x) mean(.x, na.rm = TRUE)
)map(
  .x,
  function(.x) mean(.x, na.rm = TRUE)
)

map(
  .x,
  \(.x) mean(.x, na.rm = TRUE)
)map(
  .x,
  \(.x) mean(.x, na.rm = TRUE)
)

map(
  .x,
  ~ mean(.x, na.rm = TRUE)
)map(
  .x,
  ~ mean(.x, na.rm = TRUE)
)

map(
  gapminder, 
  \(.x) length(unique(.x))
)

$country
[1] 142

$continent
[1] 5

$year
[1] 12

$lifeExp
[1] 1626

$pop
[1] 1704

$gdpPercap
[1] 1704

Returning types

map	returns
`map()`	list
`map_chr()`	character vector
`map_dbl()`	double vector (numeric)
`map_int()`	integer vector
`map_lgl()`	logical vector
`map_dfc()`	data frame (by column)
`map_dfr()`	data frame (by row)

Iterating with functions: revisited

files <- c(
  "data/weird_data1.xlsx", 
  "data/weird_data2.xlsx", 
  "data/weird_data3.xlsx"
)

weird_data <- map(files, read_weird_excel) |> 
  bind_rows()files <- c(
  "data/weird_data1.xlsx", 
  "data/weird_data2.xlsx", 
  "data/weird_data3.xlsx"
)

weird_data <- map(files, read_weird_excel) |> 
  bind_rows()

Iterating with functions: revisited

files <- c("data/weird_data1.xlsx", "data/weird_data2.xlsx", "data/weird_data3.xlsx")
weird_data <- map_dfr(files, read_weird_excel)

Returning types

map_int(gapminder, \(.x) length(unique(.x)))

  country continent      year   lifeExp       pop gdpPercap 
      142         5        12      1626      1704      1704

Your Turn 6

Do the same as #4 above but return a vector instead of a list.

Your Turn 6

map_chr(diabetes, class)

         id        chol    stab.glu         hdl       ratio       glyhb 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
   location         age      gender      height      weight       frame 
"character"   "numeric" "character"   "numeric"   "numeric" "character" 
      bp.1s       bp.1d       bp.2s       bp.2d       waist         hip 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
   time.ppn 
  "numeric"

Your Turn 7

Check `diabetes` for any missing data.

Using the `\(.x) .f(.x)` shorthand, check each column for any missing values using `is.na()` and `any()`

Return a logical vector. Are any columns missing data? What happens if you don’t include `any()`? Why?

Try counting the number of missing, returning an integer vector

Your Turn 7

map_lgl(diabetes, \(.x) any(is.na(.x)))

      id     chol stab.glu      hdl    ratio    glyhb location      age 
   FALSE     TRUE    FALSE     TRUE     TRUE     TRUE    FALSE    FALSE 
  gender   height   weight    frame    bp.1s    bp.1d    bp.2s    bp.2d 
   FALSE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE 
   waist      hip time.ppn 
    TRUE     TRUE     TRUE

Your Turn 7

map_int(diabetes, \(.x) sum(is.na(.x)))

      id     chol stab.glu      hdl    ratio    glyhb location      age 
       0        1        0        1        1       13        0        0 
  gender   height   weight    frame    bp.1s    bp.1d    bp.2s    bp.2d 
       0        5        1       12        5        5      262      262 
   waist      hip time.ppn 
       2        2        3

`group_map()`

Apply a function across a grouping variable and return a list of grouped tibbles

library(broom)
diabetes |> 
  group_by(gender) |>
  group_map(\(.x, ...) tidy(lm(weight ~ height, data = .x)))library(broom)
diabetes |> 
  group_by(gender) |>
  group_map(\(.x, ...) tidy(lm(weight ~ height, data = .x)))library(broom)
diabetes |> 
  group_by(gender) |>
  group_map(\(.x, ...) tidy(lm(weight ~ height, data = .x)))

`group_map()`

[[1]]
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -73.8     59.2       -1.25 0.214    
2 height          3.90     0.928      4.20 0.0000383

[[2]]
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -49.7     68.9      -0.722 0.471   
2 height          3.35     0.995     3.37  0.000945

`group_modify()`

Apply a function across grouped tibbles and return grouped tibbles

diabetes |> 
  group_by(gender) |> 
  group_modify(\(.x, ...) tidy(lm(weight ~ height, data = .x)))

# A tibble: 4 × 6
# Groups:   gender [2]
  gender term        estimate std.error statistic   p.value
  <chr>  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 female (Intercept)   -73.8     59.2      -1.25  0.214    
2 female height          3.90     0.928     4.20  0.0000383
3 male   (Intercept)   -49.7     68.9      -0.722 0.471    
4 male   height          3.35     0.995     3.37  0.000945

Your Turn 8

Fill in the model_lm function to model chol (the outcome) with ratio and pass the .data argument to lm()

Group `diabetes` by `location`

Use `group_modify()` with `model_lm`

Your Turn 8

model_lm <- function(.data, ...) {
  mdl <- lm(chol ~ ratio, data = .data) 
  # get model statistics
  glance(mdl)
}

diabetes |> 
  group_by(location) |> 
  group_modify(model_lm)

Your Turn 8

# A tibble: 2 × 13
# Groups:   location [2]
  location  r.squared adj.r.squared sigma statistic  p.value
  <chr>         <dbl>         <dbl> <dbl>     <dbl>    <dbl>
1 Buckingh…     0.252         0.248  38.8      66.4 4.11e-14
2 Louisa        0.204         0.201  39.4      51.7 1.26e-11
# ℹ 7 more variables: df <dbl>, logLik <dbl>, AIC <dbl>,
#   BIC <dbl>, deviance <dbl>, df.residual <int>, …

`map2(.x, .y, .f)`

.x, .y: a vector, list, or data frame

.f: a function that takes two arguments

Returns a list

map2()

means <- c(-3, 4, 2, 2.3)
sds <- c(.3, 4, 2, 1)
  
map2_dbl(means, sds, rnorm, n = 1)means <- c(-3, 4, 2, 2.3)
sds <- c(.3, 4, 2, 1)
  
map2_dbl(means, sds, rnorm, n = 1)

[1] -3.0587804  1.4037210 -0.2195345  3.1492742

Your Turn 9

Split the gapminder dataset into a list by country using the `split()` function

Create a list of models using `map()`. For the first argument, pass gapminder_countries. For the second, use the `\()` notation to write a model with `lm()`. Use `lifeExp` on the left hand side of the formula and `year` on the second. Pass `.x` to the data argument.

Use `map2()` to take the models list and the data set list and map them to `predict()`. Since we’re not adding new arguments, you don’t need to use `\()`.

Your Turn 9

gapminder_countries <- split(gapminder, gapminder$country) 
models <- map(
  gapminder_countries, 
  \(.x) lm(lifeExp ~ year, data = .x)
)
preds <- map2(models, gapminder_countries, predict)
head(preds, 3)gapminder_countries <- split(gapminder, gapminder$country) 
models <- map(
  gapminder_countries, 
  \(.x) lm(lifeExp ~ year, data = .x)
)
preds <- map2(models, gapminder_countries, predict)
head(preds, 3)gapminder_countries <- split(gapminder, gapminder$country) 
models <- map(
  gapminder_countries, 
  \(.x) lm(lifeExp ~ year, data = .x)
)
preds <- map2(models, gapminder_countries, predict)
head(preds, 3)

Your Turn 9

$Afghanistan
       1        2        3        4        5        6        7        8 
29.90729 31.28394 32.66058 34.03722 35.41387 36.79051 38.16716 39.54380 
       9       10       11       12 
40.92044 42.29709 43.67373 45.05037 

$Albania
       1        2        3        4        5        6        7        8 
59.22913 60.90254 62.57596 64.24938 65.92279 67.59621 69.26962 70.94304 
       9       10       11       12 
72.61646 74.28987 75.96329 77.63671 

$Algeria
       1        2        3        4        5        6        7        8 
43.37497 46.22137 49.06777 51.91417 54.76057 57.60697 60.45337 63.29976 
       9       10       11       12 
66.14616 68.99256 71.83896 74.68536

input 1	input 2	returns
`map()`	`map2()`	list
`map_chr()`	`map2_chr()`	character vector
`map_dbl()`	`map2_dbl()`	double vector (numeric)
`map_int()`	`map2_int()`	integer vector
`map_lgl()`	`map2_lgl()`	logical vector
`map_dfc()`	`map2_dfc()`	data frame (by column)
`map_dfr()`	`map2_dfr()`	data frame (by row)

Other mapping functions

pmap() and friends: take n lists or data frame with argument names

walk() and friends: for side effects like plotting; returns input invisibly

imap() and friends: includes counter i

map_if(), map_at(): Apply only to certain elements

input 1	input 2	input n	returns
`map()`	`map2()`	`pmap()`	list
`map_chr()`	`map2_chr()`	`pmap_chr()`	character vector
`map_dbl()`	`map2_dbl()`	`pmap_dbl()`	double vector (numeric)
`map_int()`	`map2_int()`	`pmap_int()`	integer vector
`map_lgl()`	`map2_lgl()`	`pmap_lgl()`	logical vector
`map_dfc()`	`map2_dfc()`	`pmap_dfc()`	data frame (by column)
`map_dfr()`	`map2_dfr()`	`pmap_dfr()`	data frame (by row)
`walk()`	`walk2()`	`pwalk()`	input (side effects!)

`group_walk()`

Use `group_walk()` for side effects across groups

# fs helps us work with files
library(fs)
temp <- "temporary_folder"
dir_create(temp)
gapminder |>
  group_by(continent) |>
  group_walk( 
    \(.x, .key) write_csv( 
      .x,
      file = path(temp,  paste0(.key$continent, ".xlsx"))
    )
  )# fs helps us work with files
library(fs)
temp <- "temporary_folder"
dir_create(temp)
gapminder |>
  group_by(continent) |>
  group_walk( 
    \(.x, .key) write_csv( 
      .x,
      file = path(temp,  paste0(.key$continent, ".xlsx"))
    )
  )

`group_walk()`

temporary_folder
├── Africa.xlsx
├── Americas.xlsx
├── Asia.xlsx
├── Europe.xlsx
└── Oceania.xlsx

Your turn 10

Create a new directory using the fs package. Call it “figures”.

Write a function to plot a line plot of a given variable in gapminder over time, faceted by continent. Then, save the plot (how do you save a ggplot?). For the file name, paste together the folder, name of the variable, and extension so it follows the pattern `"folder/variable_name.png"`

Create a character vector that has the three variables we’ll plot: “lifeExp”, “pop”, and “gdpPercap”.

Use `walk()` to save a plot for each of the variables

Your turn 10

dir_create("figures")

ggsave_gapminder <- function(variable) {
  variable <- rlang::ensym(variable)
  p <- ggplot(
    gapminder, 
    aes(x = year, y = {{ variable }}, color = country)
  ) + 
    geom_line() + 
    scale_color_manual(values = country_colors) + 
    facet_wrap(~ continent) + 
    theme(legend.position = "none")
    
  ggsave(
    filename = paste0("figures/", variable, ".png"), 
    plot = p, 
    dpi = 320
  )
}

Your turn 10

vars <- c("lifeExp", "pop", "gdpPercap")
walk(vars, ggsave_gapminder)

Base R

base R	purrr
`lapply()`	`map()`
`vapply()`	`map_*()`
`sapply()`	?
`x[] <- lapply()`	`map_dfc()`
`mapply()`	`map2()`, `pmap()`

Benefits of purrr

Consistent
Type-safe

Loops vs functional programming

x <- rnorm(10)
y <- map(x, mean)

Loops vs functional programming

x <- rnorm(10)
y <- vector("list", length(x))
for (i in seq_along(x)) {
  y[[i]] <- mean(x[[i]])
}x <- rnorm(10)
y <- vector("list", length(x))
for (i in seq_along(x)) {
  y[[i]] <- mean(x[[i]])
}x <- rnorm(10)
y <- vector("list", length(x))
for (i in seq_along(x)) {
  y[[i]] <- mean(x[[i]])
}x <- rnorm(10)
y <- vector("list", length(x))
for (i in seq_along(x)) {
  y[[i]] <- mean(x[[i]])
}

Of course someone has to write loops. It doesn’t have to be you. —Jenny Bryan