Using the Slurm workload manager
Slurm is an open source workload manager and job scheduler that is now used for all SeaWulf queues in place of PBS Torque/Maui. This FAQ will explain how to use Slurm to submit jobs. This FAQ utilizes information from several web resources. Please see here and here for additional documentation.
To use Slurm, first load the Slurm module:
module load slurm
Slurm wrappers for Torque
Users are encouraged to learn to use Slurm commands (see below) to take full advantage of Slurm's flexibility and to facilitate easier troubleshooting. However, to ease the transition between the two workload systems, Slurm comes equipped with several wrapper scripts that will allow users to use many common Torque commands. These include:
qsub, qstat, qdel, qhold, qrls, and pbsnodes
Slurm commands
The following tables provide a list of HPC job-related functions and the equivalent Torque and Slurm commands needed to execute these functions.
User commands | PBS/Torque | SLURM |
---|---|---|
Job submission | qsub [filename] | sbatch [filename] |
Job deletion | qdel [job_id] | scancel [job_id] |
Job status (by job) | qstat [job_id] | squeue --job [job_id] |
Full job status (by job) | qstat -f [job_id] | scontrol show job [job_id] |
Job status (by user) | qstat -u [username] | squeue --user=[username] |
Environment variables | PBS/Torque | SLURM |
---|---|---|
Job ID | $PBS_JOBID | $SLURM_JOBID |
Submit Directory | $PBS_O_WORKDIR | $SLURM_SUBMIT_DIR |
Node List | $PBS_NODEFILE | $SLURM_JOB_NODELIST |
Job specification | PBS/Torque | SLURM |
---|---|---|
Script directive | #PBS | #SBATCH |
Job Name | -N [name] | --job-name=[name] OR -J [name] |
Node Count | -l nodes=[count] | --nodes=[min[-max]] OR -N [min[-max]] |
CPU Count | -l ppn=[count] | --ntasks-per-node=[count] |
CPUs Per Task | --cpus-per-task=[count] | |
Memory Size | -l mem=[MB] | --mem=[MB] OR --mem-per-cpu=[MB] |
Wall Clock Limit | -l walltime=[hh:mm:ss] | --time=[min] OR --time=[days-hh:mm:ss] |
Node Properties | -l nodes=4:ppn=8:[property] | --constraint=[list] |
Standard Output File | -o [file_name] | --output=[file_name] OR -o [file_name] |
Standard Error File | -e [file_name] | --error=[file_name] OR -e [file_name] |
Combine stdout/stderr | -j oe (both to stdout) | (Default if you don’t specify --error) |
Job Arrays | -t [array_spec] | --array=[array_spec] OR -a [array_spec] |
Delay Job Start | -a [time] | --begin=[time] |
Select Queue | -q [queue_name] | -p [queue_name] |
Please note that when you submit jobs with Slurm, all of your environment variables will by default be copied into the environment for your job. This includes all of the modules you have loaded on the login node at the time of submitting your job. You can disable this by using the --export=NONE flag with sbatch. This will cause Slurm to behave the same way as Torque, only loading environment variables from your ~/.bashrc and ~/.bash_profile but not from the current environment.
Slurm scripts will also run in the directory from which you submit the job. You can adjust this with the -D <directory> or --chdir=<directory> flag with sbatch. For example, you could add
#SBATCH -D $HOME
to your Slurm script to replicate the behavior of a PBS Torque script, which by default runs in your home directory.
Submitting interactive jobs with Slurm
You may use the following to submit an interactive job:
srun -J [job_name] -N 1 -p [queue_name] --ntasks-per-node=28 --pty bash
This will start an interactive job using a single node and 28 CPUs per node. The key flag to create an interactive job is --pty bash. This will open up a bash session on the compute node, allowing you to issue commands interactively.
Example Slurm job script for 40-core queues
#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res.txt #SBATCH --ntasks-per-node=40 #SBATCH --nodes=2 #SBATCH --time=05:00 #SBATCH -p short-40core # In this example we're using the 2019 versions of the Intel compiler, MPI implementation, and math kernel library module load intel/oneAPI/2022.2 module load compiler/latest mpi/latest mkl/latest cd /gpfs/projects/samples/intel_mpi_hello/ mpiicc mpi_hello.c -o intel_mpi_hello mpirun ./intel_mpi_hello
This job will utilize 2 nodes, with 40 CPUs per node for 5 minutes in the short-40core queue to run the intel_mpi_hello script.
If we named this script "test.slurm", we could submit the job using the following command:
sbatch test.slurm
Example Slurm job script for GPU queues
#!/bin/bash # #SBATCH --job-name=test-gpu #SBATCH --output=res.txt #SBATCH --ntasks-per-node=28 #SBATCH --nodes=1 #SBATCH --time=05:00 #SBATCH -p gpu module load anaconda/3 module load cuda91/toolkit/9.1 module load cudnn/6.0 source activate tensorflow1.4 cd /gpfs/projects/samples/tensorflow python tensor_hello3.py
The documentation for Slurm can be found here.