Writing Job Scripts | Research Computing

Script Structure

A well-structured SBATCH script follows a consistent layout that makes it readable, maintainable, and less prone to errors. Every SBATCH script should follow this basic structure:

1. Shebang Line: Always start with the interpreter directive

2. SBATCH Directives: Resource requirements and job configuration

3. Environment Setup: Module loading and variable definitions

4. Job Execution: The actual commands to run

Complete Script Template

1. Shebang Line

#!/bin/bash

2. SBATCH Directives (Resource Requirements)

# Job identification
#SBATCH --job-name=my_job
#SBATCH --output=results_%j.out
#SBATCH --error=results_%j.err

# Resource allocation
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --mem=64GB
#SBATCH --time=02:00:00

# Queue selection
#SBATCH -p short-40core

3. Environment Setup

# Load required modules
module purge
module load intel/oneAPI/2022.2
module load compiler/latest
module load mpi/latest

# Set environment variables
export OMP_NUM_THREADS=1
export I_MPI_PIN_DOMAIN=omp

# Display job info
echo "Job ID: $SLURM_JOBID"
echo "Running on nodes: $SLURM_JOB_NODELIST"
echo "Number of tasks: $SLURM_NTASKS"
echo "Start time: $(date)"

4. Job Execution

# Change to working directory
cd $SLURM_SUBMIT_DIR

# Compile application (if needed)
mpiicc -O3 -o my_app my_source.c

# Run the application
echo "Starting computation..."
mpirun ./my_app input.dat > computation.log

# Post-processing (if needed)
echo "Job completed at: $(date)"

SBATCH Directive Best Practices

Group Related Directives

Organize directives logically with comments:

# Job identification
#SBATCH --job-name=protein_folding
#SBATCH --output=folding_%j.out
#SBATCH --error=folding_%j.err

# Resource requirements
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=40
#SBATCH --mem-per-cpu=2GB
#SBATCH --time=12:00:00

# Job placement
#SBATCH -p long-40core

Use Descriptive Names

Make output files easy to identify:

# Good: descriptive filenames
#SBATCH --output=simulation_%x_%j.out
#SBATCH --error=simulation_%x_%j.err

# %x = job name, %j = job ID

Include All Required Resources

Be explicit about your needs:

# Specify all resources clearly
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=28
#SBATCH --cpus-per-task=1
#SBATCH --mem=120GB
#SBATCH --time=06:30:00
#SBATCH --gres=gpu:2  # For GPU jobs

Environment Setup Guidelines

Always Purge Modules First

module purge
module load specific/version/needed

This ensures a clean environment and prevents module conflicts.

Set Important Variables

# Threading control
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# MPI settings
export I_MPI_PIN_DOMAIN=omp
export I_MPI_PIN_PROCESSOR_LIST=allcores

# Application-specific variables
export TMPDIR=/tmp/$USER/$SLURM_JOBID

Add Job Information

# Print useful job information
echo "========================================="
echo "Job ID: $SLURM_JOBID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Nodes: $SLURM_JOB_NODELIST"
echo "CPUs per task: $SLURM_CPUS_PER_TASK"
echo "Tasks per node: $SLURM_NTASKS_PER_NODE"
echo "Working directory: $(pwd)"
echo "Start time: $(date)"
echo "========================================="

Job Execution Best Practices

Change to Submit Directory

cd $SLURM_SUBMIT_DIR

Ensures your job runs from the directory where you submitted it.

Add Error Checking

# Exit on any error
set -e

# Check if input files exist
if [ ! -f "input.dat" ]; then
    echo "ERROR: input.dat not found"
    exit 1
fi

# Run application with error checking
mpirun ./my_app input.dat
if [ $? -ne 0 ]; then
    echo "ERROR: Application failed"
    exit 1
fi

Include Timing and Logging

# Time the main computation
echo "Starting main computation at: $(date)"
start_time=$(date +%s)

mpirun ./my_app input.dat

end_time=$(date +%s)
runtime=$((end_time - start_time))
echo "Computation completed in $runtime seconds"

Common Directive Reference

Directive	Purpose	Example
`-p, --partition`	Specify queue/partition	`#SBATCH -p short-40core`
`-N, --nodes`	Number of nodes	`#SBATCH -N 2`
`-n, --ntasks`	Total number of tasks	`#SBATCH -n 80`
`--ntasks-per-node`	Tasks per node	`#SBATCH --ntasks-per-node=40`
`-c, --cpus-per-task`	CPUs per task (for threading)	`#SBATCH -c 40`
`--mem`	Memory per node	`#SBATCH --mem=64GB`
`--mem-per-cpu`	Memory per CPU	`#SBATCH --mem-per-cpu=2GB`
`-t, --time`	Wall time limit	`#SBATCH -t 02:30:00`
`-J, --job-name`	Job name	`#SBATCH -J my_job`
`-o, --output`	Standard output file	`#SBATCH -o job_%j.out`
`-e, --error`	Standard error file	`#SBATCH -e job_%j.err`
`--gres`	Generic resources (GPUs)	`#SBATCH --gres=gpu:2`
`--array`	Job array indices	`#SBATCH --array=1-100`

Pro Tip: Keep a template script for each type of job you commonly run. This saves time and reduces errors when submitting new jobs.

Remember: Always test your scripts with small resource requests first. Use interactive sessions to debug before submitting large batch jobs.