Parallel Processing – R Coding Club

This session was recorded and uploaded on YouTube here:

In this session, we covered how to do parallel processing using the {furrr} package. We also included the functions from the {tidyverse} package for convenience. We use the penguins dataset provided in the{palmerpenguins} package as example. So first, let’s load up our packages!

library(tidyverse)
library(furrr)
library(palmerpenguins)

Let’s first take a look at what are the variable names in the data:

penguins %>%
  names()

[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"

We then check how many penguins are in each specie in each sex.

penguins %>%
  count(species, sex)

# A tibble: 8 × 3
  species   sex        n
  <fct>     <fct>  <int>
1 Adelie    female    73
2 Adelie    male      73
3 Adelie    <NA>       6
4 Chinstrap female    34
5 Chinstrap male      34
6 Gentoo    female    58
7 Gentoo    male      61
8 Gentoo    <NA>       5

Since we want to play around with parallel processing, possibly applying one function to different scenarios/settings/groups at the same time. We can build a function for linear regression with 2 variables, bill_length_mm, body_mass_g in the penguins dataset.

tidy_lm <- function(data) {
  results <- lm(bill_length_mm ~ body_mass_g,
    data = data
  )
  broom::tidy(results) %>%
    # This will make the result shown in tibble format
    mutate(
      sex = unique(data$sex),
      # A new column sex is added in front of the result tibble
      .before = everything()
    )
}

Then we use plan(multisession) to start a parallel processing.

plan(multisession)

We then apply this function to different sex groups using the future_map() function, which is specifically designed for parallel processing. In this case, what is being paralleled is male and female penguins.

penguins %>%
  # Remove the penguins that have NA values in "sex".
  drop_na(sex) %>%
  # Split the data to 2 groups according to "sex" and put the data in a list
  # object
  group_split(sex) %>%
  # Using the future_map() function to apply tidy_lm() to the 2 groups and
  # return the result as a list.
  future_map(tidy_lm) %>%
  # The result in the list are rbinded as one tibble.
  list_rbind()

# A tibble: 4 × 6
  sex    term        estimate std.error statistic  p.value
  <fct>  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 female (Intercept) 25.6      1.84         13.9  1.70e-29
2 female body_mass_g  0.00428  0.000469      9.12 2.70e-16
3 male   (Intercept) 31.1      2.14         14.5  2.70e-31
4 male   body_mass_g  0.00325  0.000465      6.99 6.45e-11

To end the parallel processing, we use plan(sequential).

plan(sequential)

To do a sequential (non-parallel) processing, we use map() instead of future_map().

penguins %>%
  drop_na(sex) %>%
  group_split(sex) %>%
  map(tidy_lm) %>%
  list_rbind()

# A tibble: 4 × 6
  sex    term        estimate std.error statistic  p.value
  <fct>  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 female (Intercept) 25.6      1.84         13.9  1.70e-29
2 female body_mass_g  0.00428  0.000469      9.12 2.70e-16
3 male   (Intercept) 31.1      2.14         14.5  2.70e-31
4 male   body_mass_g  0.00325  0.000465      6.99 6.45e-11

Warning

Note: Creating a multisession itself will take time. For simple functions, this time may exceed the time needed for running all the code sequiential. So it is better you only use parallel processing for time-comsuming tasks

Now let us try a more complex scenario with a new function using reformulate(), which is specifically designed for writing formulas

# Take argument of Y variable and confounders
tidy_lm_yvar <- function(yvar, confounders) {
  # The X variables in the linear regression
  model_formula <- reformulate(c("body_mass_g", confounders),
    # The Y variable in the linear regression
    response = yvar
  )
  results <- lm(model_formula,
    data = penguins
  )
  broom::tidy(results) %>%
    mutate(
      yvar = yvar,
      .before = everything()
    )
}

Try run the function with parallel processing, where 3 response variables (bill_length_mm, bill_depth_mm, flipper_length_mm) will be parallelized. Note that what is piped in the future_map() (separately/in parallel) will be used as the first argument of the function specified in it. If there is a second argument, in our case, confounder, we directly write it in the function.

# Start parallel processing
plan(multisession)

c("bill_length_mm", "bill_depth_mm", "flipper_length_mm") %>%
    # This is how it is written for the first argument if there is more than one
    # argument
  future_map(~ tidy_lm_yvar(.x,
    confounders = "sex"
  )) %>%
  # The "names_to" argument in the list_rbind() function add a new variable
  # called "model_ID", which establish the names of items in the original list
  # object
  list_rbind(names_to = "model_ID") %>%
  # Change the value in the variable model_ID
  mutate(model_ID = case_when(
    model_ID == 1 ~ "bill1",
    model_ID == 2 ~ "bill2",
    model_ID == 3 ~ "bill3"
  ))

# A tibble: 9 × 7
  model_ID yvar              term         estimate std.error statistic   p.value
  <chr>    <chr>             <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 bill1    bill_length_mm    (Intercept)  27.9     1.32          21.1  3.83e- 63
2 bill1    bill_length_mm    body_mass_g   0.00367 0.000331      11.1  1.51e- 24
3 bill1    bill_length_mm    sexmale       1.25    0.532          2.34 1.97e-  2
4 bill2    bill_depth_mm     (Intercept)  23.7     0.365         65.0  4.27e-190
5 bill2    bill_depth_mm     body_mass_g  -0.00188 0.0000912    -20.6  2.57e- 61
6 bill2    bill_depth_mm     sexmale       2.75    0.147         18.8  5.96e- 54
7 bill3    flipper_length_mm (Intercept) 135.      1.99          67.6  2.28e-195
8 bill3    flipper_length_mm body_mass_g   0.0162  0.000498      32.6  3.33e-105
9 bill3    flipper_length_mm sexmale      -3.96    0.801         -4.94 1.25e-  6

# Close parallel processing
plan(sequential)

Another way to change the name of "Model_ID".

c("bill_length_mm", "bill_depth_mm", "flipper_length_mm") %>%
  future_map(~ tidy_lm_yvar(.x,
    confounders = "sex"
  )) %>%
  list_rbind(names_to = "model_ID") %>%
  mutate(model_ID = str_replace(model_ID, "^", "bill"))

# A tibble: 9 × 7
  model_ID yvar              term         estimate std.error statistic   p.value
  <chr>    <chr>             <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 bill1    bill_length_mm    (Intercept)  27.9     1.32          21.1  3.83e- 63
2 bill1    bill_length_mm    body_mass_g   0.00367 0.000331      11.1  1.51e- 24
3 bill1    bill_length_mm    sexmale       1.25    0.532          2.34 1.97e-  2
4 bill2    bill_depth_mm     (Intercept)  23.7     0.365         65.0  4.27e-190
5 bill2    bill_depth_mm     body_mass_g  -0.00188 0.0000912    -20.6  2.57e- 61
6 bill2    bill_depth_mm     sexmale       2.75    0.147         18.8  5.96e- 54
7 bill3    flipper_length_mm (Intercept) 135.      1.99          67.6  2.28e-195
8 bill3    flipper_length_mm body_mass_g   0.0162  0.000498      32.6  3.33e-105
9 bill3    flipper_length_mm sexmale      -3.96    0.801         -4.94 1.25e-  6

Done!