This session was recorded and uploaded on YouTube here:
In this session, we covered how to do parallel processing using the {furrr}
package. We also included the functions from the {tidyverse}
package for convenience. We use the penguins dataset provided in the{palmerpenguins}
package as example. So first, let’s load up our packages!
Let’s first take a look at what are the variable names in the data:
[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
We then check how many penguins are in each specie in each sex.
# A tibble: 8 × 3
species sex n
<fct> <fct> <int>
1 Adelie female 73
2 Adelie male 73
3 Adelie <NA> 6
4 Chinstrap female 34
5 Chinstrap male 34
6 Gentoo female 58
7 Gentoo male 61
8 Gentoo <NA> 5
Since we want to play around with parallel processing, possibly applying one function to different scenarios/settings/groups at the same time. We can build a function for linear regression with 2 variables, bill_length_mm, body_mass_g in the penguins dataset.
Then we use plan(multisession)
to start a parallel processing.
We then apply this function to different sex groups using the future_map()
function, which is specifically designed for parallel processing. In this case, what is being paralleled is male and female penguins.
penguins %>%
# Remove the penguins that have NA values in "sex".
drop_na(sex) %>%
# Split the data to 2 groups according to "sex" and put the data in a list
# object
group_split(sex) %>%
# Using the future_map() function to apply tidy_lm() to the 2 groups and
# return the result as a list.
future_map(tidy_lm) %>%
# The result in the list are rbinded as one tibble.
list_rbind()
# A tibble: 4 × 6
sex term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 female (Intercept) 25.6 1.84 13.9 1.70e-29
2 female body_mass_g 0.00428 0.000469 9.12 2.70e-16
3 male (Intercept) 31.1 2.14 14.5 2.70e-31
4 male body_mass_g 0.00325 0.000465 6.99 6.45e-11
To end the parallel processing, we use plan(sequential)
.
To do a sequential (non-parallel) processing, we use map()
instead of future_map()
.
# A tibble: 4 × 6
sex term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 female (Intercept) 25.6 1.84 13.9 1.70e-29
2 female body_mass_g 0.00428 0.000469 9.12 2.70e-16
3 male (Intercept) 31.1 2.14 14.5 2.70e-31
4 male body_mass_g 0.00325 0.000465 6.99 6.45e-11
Note: Creating a multisession itself will take time. For simple functions, this time may exceed the time needed for running all the code sequiential. So it is better you only use parallel processing for time-comsuming tasks
Now let us try a more complex scenario with a new function using reformulate()
, which is specifically designed for writing formulas
# Take argument of Y variable and confounders
tidy_lm_yvar <- function(yvar, confounders) {
# The X variables in the linear regression
model_formula <- reformulate(c("body_mass_g", confounders),
# The Y variable in the linear regression
response = yvar
)
results <- lm(model_formula,
data = penguins
)
broom::tidy(results) %>%
mutate(
yvar = yvar,
.before = everything()
)
}
Try run the function with parallel processing, where 3 response variables (bill_length_mm
, bill_depth_mm
, flipper_length_mm
) will be parallelized. Note that what is piped in the future_map()
(separately/in parallel) will be used as the first argument of the function specified in it. If there is a second argument, in our case, confounder, we directly write it in the function.
# Start parallel processing
plan(multisession)
c("bill_length_mm", "bill_depth_mm", "flipper_length_mm") %>%
# This is how it is written for the first argument if there is more than one
# argument
future_map(~ tidy_lm_yvar(.x,
confounders = "sex"
)) %>%
# The "names_to" argument in the list_rbind() function add a new variable
# called "model_ID", which establish the names of items in the original list
# object
list_rbind(names_to = "model_ID") %>%
# Change the value in the variable model_ID
mutate(model_ID = case_when(
model_ID == 1 ~ "bill1",
model_ID == 2 ~ "bill2",
model_ID == 3 ~ "bill3"
))
# A tibble: 9 × 7
model_ID yvar term estimate std.error statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 bill1 bill_length_mm (Intercept) 27.9 1.32 21.1 3.83e- 63
2 bill1 bill_length_mm body_mass_g 0.00367 0.000331 11.1 1.51e- 24
3 bill1 bill_length_mm sexmale 1.25 0.532 2.34 1.97e- 2
4 bill2 bill_depth_mm (Intercept) 23.7 0.365 65.0 4.27e-190
5 bill2 bill_depth_mm body_mass_g -0.00188 0.0000912 -20.6 2.57e- 61
6 bill2 bill_depth_mm sexmale 2.75 0.147 18.8 5.96e- 54
7 bill3 flipper_length_mm (Intercept) 135. 1.99 67.6 2.28e-195
8 bill3 flipper_length_mm body_mass_g 0.0162 0.000498 32.6 3.33e-105
9 bill3 flipper_length_mm sexmale -3.96 0.801 -4.94 1.25e- 6
Another way to change the name of "Model_ID"
.
c("bill_length_mm", "bill_depth_mm", "flipper_length_mm") %>%
future_map(~ tidy_lm_yvar(.x,
confounders = "sex"
)) %>%
list_rbind(names_to = "model_ID") %>%
mutate(model_ID = str_replace(model_ID, "^", "bill"))
# A tibble: 9 × 7
model_ID yvar term estimate std.error statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 bill1 bill_length_mm (Intercept) 27.9 1.32 21.1 3.83e- 63
2 bill1 bill_length_mm body_mass_g 0.00367 0.000331 11.1 1.51e- 24
3 bill1 bill_length_mm sexmale 1.25 0.532 2.34 1.97e- 2
4 bill2 bill_depth_mm (Intercept) 23.7 0.365 65.0 4.27e-190
5 bill2 bill_depth_mm body_mass_g -0.00188 0.0000912 -20.6 2.57e- 61
6 bill2 bill_depth_mm sexmale 2.75 0.147 18.8 5.96e- 54
7 bill3 flipper_length_mm (Intercept) 135. 1.99 67.6 2.28e-195
8 bill3 flipper_length_mm body_mass_g 0.0162 0.000498 32.6 3.33e-105
9 bill3 flipper_length_mm sexmale -3.96 0.801 -4.94 1.25e- 6
Done!