A guide to "embarrassingly parallel" workflows on SeaWulf
This article will discuss how to successfully parallelize "embarrassingly parallel" workflows using multiple CPUs and multiple nodes on SeaWulf.
One of the goals of High Performance Computing is to utilize many CPUs in parallel to decrease the time required to complete an analysis.
Users may often find themselves in a situation where they need to run the same task many times independently, where each individual task does not need to communicate with any other (e.g., running a piece of code on many different inputs). This is commonly referred to as an "Embarrassingly Parallel" situation.
Embarrassingly Parallel workflows on SeaWulf
There are several different ways of efficiently executing an embarrassingly parallel workflow on SeaWulf. As a simple example, let's use the following python script:
#!/usr/bin/env python import numpy as np import argparse parser = argparse.ArgumentParser(description="Take in a number and print out its square") parser.add_argument('integer', metavar='N', type=int, help='an integer for the calculation') args = parser.parse_args() result = int(np.square(args.integer)) print(f'The square of {args.integer} is {result}')
This script takes a single integer argument and prints out the square of that integer. Note that this toy calculation is fast and simple enough even when run in serial that parallelizing it is not especially necessary or beneficial. But many real workflows will benefit from parallelization.
That caveat aside, let's run the script:
$ python number_square.py 8 The square of 8 is 64
Note: This script requires python 3.6+ which can be loaded with the following module:
module load anaconda/3
If we wanted to run this script on every integer from 1 to 100, we could do this sequentially:
python number_square.py 1 python number_square.py 2 python number_square.py 3 python number_square.py 4 python number_square.py 5 ... python number_square.py 100
However, executing this script separately 100 times would be annoying to set up and would potentially take a prohibitively long time to run for anything but a toy example like this. Running each calculation in a loop and then backgrounding each process would be one alternative, but this may result in more simultaneous processes running than we want.
Instead, we can take advantage of one of several different tools to parallelize the execution of this script across all 100 inputs.
1. GNU Parallel
GNU parallel is an open source command line tool that is both easy to use and designed specifically for embarrassingly parallel workflows. To access GNU parallel on SeaWulf, do the following:
module load gnu-parallel/6.0
Let's use GNU parallel as part of a Slurm job script:
#!/usr/bin/env bash #SBATCH --ntasks=1 #SBATCH --nodes=1 #SBATCH --cpus-per-task=40 #SBATCH --time=05:00 #SBATCH --partition=short-40core #SBATCH --job-name=parallel_job #SBATCH --output=squared_numbers.txt # load required modules module load anaconda/3 module load gnu-parallel/6.0 # parallelize the calculation across each input, running 40 processes at a time parallel --jobs 40 python number_square.py {} ::: {1..100}
In the above example, we are requesting a single node job, with 1 Slurm task that utilizes all 40 cores of the node.
In the parallel command, the "--jobs 40" flag indicates that we want to run 40 processes at a time (the default is to run as many processes in a parallel as there are cores on the node, so setting this flag is not required if we want to utilize all 40 cores at once). The curly brackets "{}" indicate the replacement string--i.e., a placeholder that will be replaced when the command is run by each of our inputs. The ":::" separates the command that is run from the inputs. Finally the "{1..100}" is the shell syntax for specifying a range of values (in increments of 1) from 1 to 100.
After running this job, we should see the following in the output file:
The square of 1 is 1 The square of 2 is 4 The square of 6 is 36 The square of 10 is 100 The square of 3 is 9 The square of 4 is 16 The square of 5 is 25 The square of 7 is 49 The square of 8 is 64 The square of 9 is 81 The square of 11 is 121 The square of 19 is 361 ...
As we can see, GNU parallel does not by default maintain the original input order of the numbers. To force it to do this, we could add the "--keep-order" flag.
This simple framework can be adapted to parallelize a variety of workflows, even many that are much more complex than this simple example. GNU parallel has excellent and thorough documentation, which users are strongly encouraged to read.
2. Python's Multiprocessing library
Since we're working with a python example, we could also simply do the parallelization of the calculations within python itself. One useful library for accomplishing this is the Multiprocessing library. Let's modify the original python script so that it uses Multiprocessing for parallelization:
#!/usr/bin/env python import numpy as np import multiprocessing as mp # define a function for the calculation def square_me(num): result = int(np.square(num)) return(result) # spawn a pool of parallel workers p = mp.Pool(processes=40) # using a pool of 40 parallel workers, loop over the values of 1 through 100 and apply the math function to each # using p.apply_async to enable the workers to work in parallel for num in range(1,101): results = p.apply_async(square_me, [num]) print(f'The square of {num} is {results.get()}') # close the pool of workers p.close()
As before, let's run this as a Slurm job:
#!/usr/bin/env bash #SBATCH --ntasks=1 #SBATCH --nodes=1 #SBATCH --cpus-per-task=40 #SBATCH --time=05:00 #SBATCH --partition=short-40core #SBATCH --job-name=parallel_job #SBATCH --output=squared_numbers.txt module load anaconda/3 python number_square_mp.py
And in the output file we see the following:
The square of 1 is 1 The square of 2 is 4 The square of 3 is 9 The square of 4 is 16 The square of 5 is 25 The square of 6 is 36 The square of 7 is 49 The square of 8 is 64 The square of 9 is 81 The square of 10 is 100 ...
Like in the GNU parallel example, the pool of workers will run the calculation for all 100 integers, using the pool of workers to parallelize 40 calculations at a time.
3. Slurm job array
A Slurm job array is a powerful tool for taking an embarrassingly parallel task and executing it across multiple nodes.
Staying with our toy python calculation, let's use a Slurm job array to run the calculation on 100 numbers:
#!/usr/bin/env bash #SBATCH --job-name=array_test #SBATCH --output=array_test.%A_%a.log #SBATCH --ntasks=1 #SBATCH -N 1 #SBATCH -p short-40core #SBATCH -t 04:00:00 #SBATCH --array=1-100 module load anaconda/3 echo "Starting task $SLURM_ARRAY_TASK_ID" INPUT=$(sed -n "${SLURM_ARRAY_TASK_ID}p" 100_numbers.txt) python number_square.py $INPUT
In this example, we are requesting 100 independent jobs by specifying "#SBATCH --array=1-100". Each job in the array will receive a separate task ID captured by the $SLURM_ARRAY_TASK_ID environment variable. Each job in the array will also produce its own log file that lists both the job ID and task ID, as indicated by "#SBATCH --output=array_test.%A_%a.log".
To actually run this job, we start with a text file ("numbers.txt") that lists each of the integers (one per line) that we want to run the calculation on.
We then use a sed expression to assign each integer to one of the array task IDs:
INPUT=$(sed -n "${SLURM_ARRAY_TASK_ID}p" 100_numbers.txt)
And then we execute the python script, giving it the new $INPUT variable as an argument.
The results of each calculation are printed to each array task log file. E.g.,
[mynetid@login2 ~]$ cat array_test.816678_74.log Starting task 74 The square of 74 is 5476
One caveat to this approach is that we are submitting a separate array task to a separate node for each entry in the input file ("numbers.txt" in this case). Because of this, there is no parallelization taking place within a node. Thus this approach is most efficient when each task that will be run is already multithreaded and can take advantage of all the cores on the node it is running on.
In our example, this is not the case--the code is not multithreaded. Thus, it might be beneficial to try to parallelize this workflow both within the node and across different nodes. We can do this by combining Slurm arrays with GNU parallel in the follow example (modified from this article on Slurm arrays):
#!/usr/bin/env bash #SBATCH --job-name=array_test #SBATCH --output=array_test.%A_%a.log #SBATCH --ntasks=1 #SBATCH --cpus-per-task=40 #SBATCH -N 1 #SBATCH -p short-40core #SBATCH -t 04:00:00 #SBATCH --array=1-40 mkdir temp module load anaconda/3 module load gnu-parallel/6.0 START=$SLURM_ARRAY_TASK_ID NUMLINES=100 STOP=$((SLURM_ARRAY_TASK_ID*NUMLINES)) START="$(($STOP - $(($NUMLINES - 1))))" echo "START=$START" echo "STOP=$STOP" for (( N = $START; N <= $STOP; N++ )) do LINE=$(sed -n "$N"p 4k_numbers.txt) echo $LINE >> temp/tasks_${START}_${STOP} done cat temp/tasks_${START}_${STOP} | parallel --jobs 40 --verbose python number_square.py {} # clean up the temp files rm temp/tasks_${START}_${STOP}
Here, we are using a slightly more complicated series of shell expressions to start a job with 40 different array tasks, each of which run 100 calculations in parallel (4000 calculations in total).
As in the previous array example, we start with an input text file that lists each of the integers to be used in the calculation. This time, the file is called "4k_numbers.txt."
For each task, we use the following to specify which range of numbers to process:
START=$SLURM_ARRAY_TASK_ID NUMLINES=100 STOP=$((SLURM_ARRAY_TASK_ID*NUMLINES)) START="$(($STOP - $(($NUMLINES - 1))))"
Then we loop over this range and use sed to write out a temporary file that includes the specific numbers that will be used for this task:
for (( N = $START; N <= $STOP; N++ )) do LINE=$(sed -n "$N"p 4k_numbers.txt) echo $LINE >> temp/tasks_${START}_${STOP} done
Finally, we use this temp file to run the 100 calculations in parallel with GNU parallel (40 calculations at a time) and then remove the temp file once it is no longer needed:
cat temp/tasks_${START}_${STOP} | parallel --jobs 40 --verbose python number_square.py {} # clean up the temp files rm temp/tasks_${START}_${STOP}
Now, when we examine the log file for each Slurm task, we have the results for 100 different calculations:
The square of 10 is 100 The square of 3 is 9 The square of 8 is 64 The square of 27 is 729 The square of 18 is 324 The square of 6 is 36 The square of 12 is 144 The square of 19 is 361 The square of 33 is 1089 The square of 32 is 1024 The square of 15 is 225 The square of 23 is 529 The square of 2 is 4 The square of 35 is 1225 ...
In this case, we have managed to parallelize the calculation across different cores of the sample node, and simultaneously across multiple nodes.
These are just a few options for parallelizing an embarrassingly parallel workflow across multiple cores and among different nodes. Using one or more of these approaches can allow users to take advantage of SeaWulf's large number of CPUs to increase the speed and efficiency of their jobs.
For a guide on how to parallelize your communication-dependent (i.e., non-embarrassingly parallel) workflows, please see part 2.