-
Notifications
You must be signed in to change notification settings - Fork 7
Expand file tree
/
Copy path120-parallel-processing.Rmd
More file actions
490 lines (346 loc) · 24.9 KB
/
120-parallel-processing.Rmd
File metadata and controls
490 lines (346 loc) · 24.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
---
output: html_document
editor_options:
chunk_output_type: console
---
# Parallel Processing {#parallel-processing}
```{r setup_parallel_chapter, include=FALSE}
# Make this Rmd file stand-alone by loading some stuff.
library( tidyverse )
library(future)
library(furrr)
```
Especially if you take our advice of "when in doubt, go more general" and if are running enough replicates to get nice and samll Monte Carlo\index{Monte Carlo simulation} errors, you will quickly come up against the computational limits of your computer.
Simulations can be incredibly computationally intensive, and there are a few means for dealing with that.
The first is to optimize code by removing extraneous calculation (e.g., by writing methods from scratch rather than using the safety-checking and thus sometimes slower methods in R, or by saving calculations that are shared across different estimation approaches) to make it run faster.
This approach is usually quite hard, and the benefits often minimal; see Appendix \@ref(optimize-code) for further discussion and examples.
The second is to use more computing power by making the simulation _parallel_.
This latter approach is the topic of this chapter.
_Parallel computation_ is where you have multiple computers working on a problem at the same time.
Simulations are very easy to do in parallel.
With a multifactor experiment it is very easy to break apart the overall into pieces.
For example, each computer could do one of your scenarios in a multi-factor simulation.
Even without a multifactor experiment, due to the cycle of "generate data, then analyze," it is easy to have a bunch of computers doing the same thing, as long as they are generating different datasets, with a final collection step where all the individual iterations are combined into a single set at the end.
In either case, the _total_ work is the same, but since each computer is doing a piece at the same time, the overall simulation will be completed much sooner.
There are three general ways to do parallel calculation.
The first is to take advantage of the fact that most modern computers come with multiple cores (i.e., computers) built in.
For this simplest approach, we simply tell R to use more of this processing power.
If your computer has eight cores, you can easily get a near eight-fold increase in the speed of your simulation.
The second is to use cloud computing, putting your simulation on a virtual computer that might have 50 or more cores.
Other than having to handle moving files back and forth from your machine to the cloud, this is not much different from the first approach: you are still just telling R to use more cores, but now you have more cores to use.
Even if the speed-up itself is not massive, this approach also takes the computing off your computer entirely, making it easier to set up a job to run for days or weeks without making your day to day life any more difficult.
The third way is to use a computing cluster.
A cluster is a network of hundreds or thousands of computers, coupled with commands for breaking apart a simulation into pieces and sending the pieces to that vast army.
Conceptually, this approach is the same as when you do baby parallel on your desktop: more cores equals more simulations per minute and thus faster simulation overall.
But the interface to a cluster can be a bit tricky, and very cluster-dependent.
But once you get a cluster up and running, it can be a very powerful tool.
Not only is it off your personal computer, a cluster can potentially give you hundreds of cores, which means a speed-up of hundreds rather than four or eight.
## Parallel on your computer
Most modern computers have multiple cores, so you can run a parallel simulation right in the privacy of your own home!
To assess how many cores you have on your computer, you can use the `detectCores()` method in the `parallel` package:
```{r}
parallel::detectCores()
```
Normally, unless you tell it to do otherwise, __*R only uses one core*__.
This is obviously a bit lazy on R's part.
But it is easy to take advantage of multiple cores using the `future` and `furrr` packages.\index{package!future}\index{package!furrr}
```{r, eval = TRUE}
library(future)
library(furrr)
```
In particular, the `furrr` package replicates our `map` functions, but in parallel.
We first tell our R session what kind of parallel processing\index{parallel processing} we want using the `future` package.
In general, using `plan(multisession)` is the cleanest: it will start one entire R session per core, and have each session do work for you.
The alternative, `multicore` does not seem to work well with Windows machines, nor with RStudio in general.
The call is simple:
```{r, eval = FALSE}
plan(multisession, workers = parallel::detectCores() - 1 )
```
The `workers` parameter specifies how many of your cores you want to use.
Using all but one will let your computer still operate mostly normally for checking email and so forth.
You are carving out a bit of space for your own adventures.
Once you set up your plan, you use `future_pmap()`; it works just like `pmap()` but evaluates across all available workers specified in the plan call.
Here we are running a parallel version of the multifactor experiment discussed in Chapter \@ref(exp_design) (see chapter \@ref(case_Cronback) for the simulation itself).
```{r, include=FALSE}
source( "case_study_code/cronbach_alpha_simulation.R")
```
```{r demo_future_pmap, eval = TRUE, cache = TRUE}
tictoc::tic()
params$res = future_pmap(params,
.f = run_alpha_sim,
.options = furrr_options(seed = NULL))
tictoc::tic()
```
Note the `.options = furrr_options(seed = NULL)` part of the argument.
This is to silence some warnings.
Given how tasks are handed out, R will get upset if you do not do some handholding regarding how it should set seeds for pseudoranom number generation.
In particular, if you do not set the seed, the multiple sessions could end up having the same starting seed and thus run the exact same simulations (in principle).
We have seen before how to set specific seed for each simulation scenario, but `furrr` doesn't know we have done this.
This is why the extra argument about seeds: it is being explicit that we are handling seed setting on our own.
We can compare the running time to running in serial (i.e. using only one worker):
```{r demo_pmap_timed, eval = TRUE, cache = TRUE}
tictoc::tic()
params$res2 = dplyr::select(params, n:seed) %>%
pmap(.f = run_alpha_sim)
tictoc::tic()
```
(The `select` command is to drop the `res` column from the parallel run; it would otherwise be passed as as parameter to `run_alpha_sim` which would in turn cause an error due to the unrecognized parameter.)
## Parallel on a virtual machine
Your laptop probably has around 8 cores, meaning you can have an 8 fold speed-up.
But wouldn't it be nice to have a computer with 50 cores? Or even more?
You can get one!
Cloud services, such as Amazon Web Services (AWS), can give you a _virtual machine_ that can have many cores (where many is usually 50 or so).
Some of these services give you what is effectively an R Studio session, and so you can run your scripts and everything just like we have discussed above, with no change.
The Data Colada blog advocates for an alternative to AWS, which is [Kamatera] (https://www.kamatera.com).
The Data Colada folks say "it is much easier to use, way faster to set up, and flexible (e.g., you can easily update R and R Studio in it)."
See [Data Colada's post](https://datacolada.org/102) for more information.
This kind of cloud computing is relatively straightforward, once you understand parallel on your own computer.
But you can go further, where you dispatch a very large number of computers on your task.
We discuss this last option next.
## Parallel on a cluster
In general, a "cluster" is a system of computers that are connected up to form a large distributed network that many different people can use to do large computational tasks (e.g., simulations!).
These clusters will have some overlaying coordinating programs that you, the user, will interact with to set up a "job," or set of jobs, which is a set of tasks you want some number of the computers on the cluster to do for you in tandum.
These coordinating programs will differ, depending on what cluster you are using, but have some similarities that bear mention.
For running simulations, you only need the smallest amount of knowledge about how to engage with these systems because you do not need all the individual computers working on your project communicating with each other (which is the hard part of distributed computing, in general).
### Command-line interfaces and CMD mode in R
In the good ol' days, when things were simpler, yet more difficult, you would interact with your computer via a "command-line interface."
The easiest way to think about this is as an R console, but in a different language that the entire computer speaks.
A command line interface is designed to do things such as find files with a specific name, or copy entire directories, or, importantly, start different programs.
One place you may have used a command line inteface is when working with Git: anything fancy with Git is often done via command-line.
People will talk about a "shell" (a generic term for this computer interface) or "bash" or "csh."
You can get access to a shell from within RStudio by clicking on the "Terminal" tab.
Try it, if you have never done anything like this before, and type (the ">" is not part of what you type):
```
> ls
```
It should list some file names.
Note this command does _not_ have the parenthesis after the command, like in R or most other programming languages.
The syntax of a shell is usually mystifying and brutal: it is best to just steal scripts from the internet or some generative AI and try not to think about it too much, unless you want to think about it a lot.
Importantly for us, from the command line interface you can start an R program, telling it to start up and run a script for you.
This way of running R is noninteractive: you say "go do this thing," and R starts up, goes and does it, and then quits.
Any output R generates on the way will be saved in a file, and any files you save along the way will also be at your disposal once R has completed.
To see this in action make the following script in a file called "dumb_job.R":
```
library( tidyverse )
cat( "Making numbers\n" )
Sys.sleep(30)
cat( "Now I'm ready\n" )
dat = tibble( A = rnorm( 1000 ), B = runif( 1000 ) * A )
write_csv( dat, file="sim_results.csv" )
Sys.sleep(30)
cat( "Finished\n" )
```
After you save, open the terminal and type `ls` to list the files, as above.
Do you see your `dumb_job.R` file?
If not, your terminal session is in the wrong directory.
In your computer system, files are stored in a directory structure, and when you open a terminal, you are somewhere in that structure.
To find out where, you can type
```
> pwd
```
for "Print Working Directory".
Save your dumb job file to wherever the above says.
You can also change directories using `cd`, e.g., `cd ~/Desktop/temp` means "change directory to the temp folder inside Desktop inside my home directory" (the `~` is shorthand for home directory).
One more useful commands is `cd ..` (go up to the parent directory).
Once you are in the directory with your file, type:
```
> R CMD BATCH dumb_job.R R_output.txt --no-save
```
The above command says "Run R" (the first part) in batch mode (the "CMD BATCH" part), meaning source the `dumb_job.R` script as soon as R starts, saving all console output in the file `R_output.txt` (it will be saved in the current directory where you run the program), and where you do not save the workspace when finished.
This command should take about a minute to complete, because our script sleeps a lot (the sleep represents your script doing a lot of work, like a real simulation would do).
Once the command completes (you will see your ">" prompt come back), verify that you have the `R_output.txt` and the data file `sim_results.csv` by typing `ls`.
If you open up your Finder or Microsoft equivilent, you can actually see the `R_output.txt` file appear half-way through, while your job is running.
If you open it, you will see the usual header of R loading up, the "Making numbers" comment, and so forth.
R is saving all the output as it works through your script.
Running R in this fashion is the key element to a basic way of setting up a massive job on the cluster: you launch a bunch of R programs that are each running a script like this on different computers in the cluster.
They will all save their results to files (they will have files of different names so you do not overwrite your work) and then you will gather these files together to get your final set of results.
*Small Exercise:* Try putting an error in your `dumb_job.R` script. What happens when you run it in batch mode?
### Task dispatchers and command line scripts
In the above, you can run a command on the command-line, but the command line interface will pause while it runs.
As you saw, when you hit return with the above R command, the program just sat there for a minute before you got your command-line prompt back, due to the sleep.
When you properly run a big job (program) on a cluster, it does not quite work that way.
You will instead set a program to run, but tell the cluster to run it somewhere else (people might say "run it in the background").
This is good because you get your command-line prompt back, and can use it to tell your computer to do other things, all while the program is running.
There are various methods for starting a job, but they usually boil down to a request from you to some sort of managerial process (a dispatcher) that takes requests and assigns some computer, somewhere, to do them.
For a sense of this process, imagine a dispatcher at a taxi company.
You call up, ask for a ride, and the dispatcher sends you a taxi to do it.
The dispatcher is just fielding requests, assinging them to taxis.
One dispatcher on some clusters is called `slurm` (which may or may not be on the cluster you are attempting to use; this is where a lot of this information gets very cluster-specific).
You first set up a script that describes the job to be run.
It is like a work request.
These scripts are plain text files, such as this example (`sbatch_runScript.txt`):
```
#!/bin/bash
# Number of cores requested
#SBATCH -n 32
# Ensure that all cores are on one machine
#SBATCH -N 1
# Upper bound on runtime in minutes
#SBATCH -t 480
# Partition to submit to
#SBATCH -p stats
# Memory per cpu in MB
#SBATCH --mem-per-cpu=1000
# Append to output file, don't truncate
#SBATCH --open-mode=append
# Standard out goes to this file
#SBATCH -o /output/directory/out/%j.out
# Standard err goes to this file
#SBATCH -e /output/directory/out/%j.err
# Email notification options : BEGIN, END, FAIL, ALL
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@gmail.com
# Load modules for the computing environment
source new-modules.sh
module load gcc/7.1.0-fasrc01
module load R
# Set user R library path
export R_LIBS_USER=$HOME/apps/R:$R_LIBS_USER
# Run R script; write output to indexed log file
R CMD BATCH estimator_performance_simulation.R \
logs/R_output_${INDEX_VAR}.txt \
--no-save --no-restore
```
This file starts by setting a bunch of variables that describe how sbatch should handle the request.
It then has a series of commands that gets the computer environment ready.
Finally, it has the `R CMD BATCH` command that does the work you want.
These scripts can be quite confusing to understand.
There are so many options!
What do these things even do?
The answer is, for researchers early on their journey to do this kind of work, "Who knows?"
The general rule is to find a working example file for the system you are on, and then modify it for your own purposes.
Once you have such a file, you could run it on the command line, like this:
```
sbatch -o stdout.txt \
--job-name=my_script \
sbatch_runScript.txt
```
You do this, and it will _not_ sit there and wait for the job to be done.
The `sbatch` command will instead send the job off to some computer which will do the work in the background.
Interestingly, your R script could, at this point, do the "one computer" parallel type code listed above and use `future_pmap()` or similar.
Note the script above asks for 32 cores; your single job could then have 32 cores all working away on their individual pieces of the simulation, as before.
You would have a 32-fold speedup, in this case.
This is the core element to having your simulation run on a cluster.
The next step is to do this _a lot_, sending off a bunch of these jobs to different computers.
### Moving files to and from a cluster
One of the larger issues with working on a cluster is you will need to get all your code _to_ the cluster and get all of your results _back_.
One approach for getting the code to the cluster is to use GitHub.
Make a project, and then check out the project on the cluster.
This has the advantage of making it easier to develop your code on your local computer, where you can easily run your RStudio or other IDE.
It also has the advantage of allowing you to record which GitHub version of code you ran, so you can know exactly which code produced which results.
For getting your results off the cluster it is not recommended to use GitHub as total of all of your results may take many gigabytes of space.
They are also not really in the spirit of a verson control system.
Instead, use a scp client such as FileZilla to transfer files back to your personal computer.
Overall, download your simulation results outside of Git, and keep your code in Git, is a good rule of thumb.
### Checking on a job
Once your job is working on the cluster, it will keep at it until it finishes (or crashes, or is terminated for taking up too much memory or time).
As it chugs away, there will be different ways to check on it.
For example, you can, from the console, list the jobs you have running to see what is happening:
```
sacct -u lmiratrix
```
except, of course, "`lmiratrix`" would be changed to whatever your username is.
This will list if your file is running, pending, timed out, etc. If it's pending, that usually means that someone else is hogging up space on the cluster and your job request is in a queue waiting to be assigned.
The `sacct` command is customizable, e.g.,
```
sacct -u lmiratrix --format=JobID,JobName%30,State
```
will not truncate your job names, so you can find them more easily.
You can check on a specific job, if you know the ID:
```
squeue -j JOBID
```
Something that's fun is you can check who's running files on the stats server by typing:
```
showq-slurm -p stats -o
```
You can also look at the log files
```
tail my_log_file.log
```
to see if it is logging information as it is working.
The email arguments, above, cause the system to email you before and after the job is complete.
The email notifications you can choose are `BEGIN`, `END`, `FAIL`, and `ALL`; `ALL` is generally good. What is a few more emails?
### Running lots of jobs on a cluster
We have seen how to fire off a job (possibly a big job) that can run over days or weeks to give you your results.
There is one more piece that can allow you to use even more computing resources to do things even faster, which is to do a whole bunch of job requests like the above, all at once.
This multiple dispatching of sbatch commands is the final component for large simulations on a cluster: you are setting in motion a bunch of processes, each set to a specific task.
Asking for multiple, smaller, jobs is also nicer for the cluster than having one giant job that goes on for a long time.
By dividing a job into smaller pieces, and asking the scheduler to schedule those pieces, you can let the scheduler share and allocate resources between you and others more fairly.
It can make a list of your jobs, and farm them out as it has space.
This might go faster for you; with a really big job, the scheduler could not even allocate it until the needed number of workers are available.
With smaller jobs, you can take a lot of little spaces to get your work done.
Especially since simulation is so independent (just doing the same thing over and over) there is rarely any need for one giant process that has to do everything.
To make multiple, related, requests, we create a for-loop in the Terminal to make a whole series sbatch requests.
Then, each sbatch request will do one part of the overall simulation.
We can write this program in the shell, just like you can write R scripts in R.
A shell scripts does a bunch of shell commands for you, and can even have variables and loops and all of that fun stuff.
For example, the following `run_full_simulation.sh` is a script that fires off a bunch of jobs for a simulation.
Note that it makes a variable `INDEX_VAR`, and sets up a loop so it can run 500 tasks indexed 1 through 500.
The first `export` line adds a collection of R libraries to the path stored in `R_LIBS_USER` (a "path" is a list of places where R will look for libraries).
The next line sets up a for loop: it will run the indented code once for each number from 1 to 500.
The script also specifies where to put log files and names each job with the index so you can know who is generating what file.
```
export R_LIBS_USER=$HOME/apps/R:$R_LIBS_USER
for INDEX_VAR in $(seq 1 500); do
#print out indexes
echo "${INDEX_VAR}"
#give indexes to R so it can find them.
export INDEX_VAR
#Run R script, and produce output files
sbatch -o logs/sbout_p${INDEX_VAR}.stdout.txt \
--job-name=runScr_p${INDEX_VAR} \
sbatch_runScript.txt
sleep 1 # pause to be kind to the scheduler
done
```
One question is then how do the different processes know what part of the simulation they should be working on?
E.g., each worker needs to have its own seed so it does not do exactly the same simulation as a different worker!
The workers also need their own filenames so they save things in their own files.
The key is the `export INDEX_VAR` line: this puts a variable in the environment that will be set to a specific number.
Inside your R script, you can get that index like so:
```{r, eval=FALSE}
index <- as.numeric(as.character(Sys.getenv("INDEX_VAR")))
```
You can then use the index to make unique filenames when you save your results, so each process has its own filename:
```{r, eval=FALSE}
filename = paste0( "raw_results/simulation_results_", index, _".rds" )
```
You can also modify your seed such as with:
```{r, eval=FALSE}
factors = mutate( factors,
seed = set.seed( 1000 * seed + index ) )
```
Now even if you have a series of seeds within the simulation script (as we have seen before), each script will have unique seeds not shared by any other script (assuming you have fewer than 1000 separate job requests).
This still doesn't exactly answer how to have each worker know what to work on.
Conider the case of our multifactor experiment, where we have a large combination of simulation trials we want to run.
There are two approaches one might use here.
One simple approach is the following: generate all the factors with `expand_grid()` as usual, and then take the row of this grid that corresponds to our index.
```{r, eval=FALSE}
sim_factors = expand_grid( ... )
index <- as.numeric(as.character(Sys.getenv("INDEX_VAR")))
filename = paste0( "raw_results/simulation_results_", index, _".rds" )
stopifnot( index >= 1 && index <= nrow(sim_factors ) )
do.call( my_sim_function, sim_factors[ index, ] )
```
The `do.call()` command runs the simulation function, passing all the arguments listed in the targeted row.
You then need to make sure you have your shell call the right number of workers to run your entire simulation.
One problem with this approach is some simulations might be a lot more work than others: consider your simulation with a huge sample size vs. one with a small sample size.
Instead, you can have each worker run a small number of simulations of each scenario, and then stack your results later.
E.g.,
```{r, eval=FALSE}
sim_factors = expand_grid( ... )
index <- as.numeric(as.character(Sys.getenv("INDEX_VAR")))
sim_factors$seed = 1000000 * index + 17 * 1:nrow(sim_factors)
```
and then do your usual `pmap` call with `R = 10` (or some other small number of replicates.)
For saving files and then loading and combining them for analysis, see Section \@ref(saving-files).
### Resources for Harvard's Odyssey
The above guidiance is tailored for Harvard' computing environment, primarily.
For that environment in particular, there are many additional resources such as:
* Odyssey Guide: https://rc.fas.harvard.edu/resources/odyssey-quickstart-guide/
* R on Odyssey: https://rc.fas.harvard.edu/resources/documentation/software/r/
For installing R packages so they are seen by the scripts run by sbatch, see (https://www.rc.fas.harvard.edu/resources/documentation/software-on-odyssey/r/)
Other clusters should have similar documents giving needed guidance for their specific contexts.
### Acknowledgements
Some of the above material is based on tutorials built by Kristen Hunter and Zach Branson, past doctoral students of Harvard's statistics department.