HPC Cheat Sheet

HPC Cheat Sheet

Essential commands and workflows for working with SeaWulf and HPC systems.

Most Common Commands

Command Purpose
sbatch job.sh Submit a job script
squeue -u $USER Check your jobs
scancel JOBID Cancel a job
sinfo Check node/partition status
module load SOFTWARE Load software module
scontrol show job JOBID View job details

Job Submission Methods

sbatch (Batch Jobs)

Submit a script to run in background. Best for production jobs.
sbatch job_script.sh

srun (Direct Execution)

Run a command immediately with specified resources. Blocks until complete.
srun --partition=short-40core --time=01:00:00 --ntasks=1 my_program

salloc (Interactive Session)

Allocate resources for interactive work. Use with srun for commands.
salloc --partition=short-40core --time=02:00:00 --ntasks=4

Job Script Template

Standard template with common directives. %j expands to job ID.
#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=results_%j.txt
#SBATCH --error=errors_%j.txt
#SBATCH --time=02:00:00
#SBATCH --partition=short-40core
#SBATCH --ntasks=40
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your.email@stonybrook.edu

module load intel-stack
mpirun ./executable

Partition Quick Reference

Partition Use Case Time Limit
short-40core Testing, quick jobs Short
long-40core Extended computations Long
gpu GPU-accelerated work Varies
hbm Memory-intensive jobs Varies
shared Partial node use Varies

Job Management

Command Function Example
squeue View job queue squeue -u $USER
scancel Cancel jobs scancel 123456
scontrol show job Job details scontrol show job 123456
sacct Job history sacct -j 123456
sstat Running job stats sstat -j 123456

Useful Queue Commands

Custom formatted view of your jobs
squeue -u $USER --format="%.10i %.12j %.8T %.10M %.6D %R"
Cancel all pending jobs
scancel -u $USER --state=pending
Job history since date
sacct --starttime=2024-01-01 --format=JobID,JobName,State,ExitCode

System Information

Command Information Example
sinfo Partition/node status sinfo -p short-40core
scontrol show nodes Node details scontrol show nodes compute-1-1
scontrol show partition Partition details scontrol show partition short-40core
sshare Fair share info sshare -u $USER

Node Status

List all nodes with details
sinfo -N -l
Custom format: nodes, cores, memory, features, GPUs, state
sinfo --format="%.15N %.10c %.10m %.25f %.10G %.6t"

File Transfer

SCP (Simple Copy)

Upload file to home
scp local_file.txt username@milan.seawulf.stonybrook.edu:~/
Upload directory
scp -r local_directory/ username@milan.seawulf.stonybrook.edu:~/
Download file
scp username@milan.seawulf.stonybrook.edu:~/remote_file.txt ./

Rsync (Recommended for Large Transfers)

Sync directories with compression
rsync -avz local_directory/ username@milan.seawulf.stonybrook.edu:~/remote_directory/
Download with progress
rsync -avz --progress username@milan.seawulf.stonybrook.edu:~/data/ ./local_data/
Rsync flags: -a (archive), -v (verbose), -z (compress), --progress (show progress), --dry-run (test first)

Environment Modules

Command Purpose Example
module avail List available modules module avail python
module load Load module module load python/3.9
module unload Unload module module unload python
module list Show loaded modules module list
module purge Unload all modules module purge
module show Module information module show python/3.9.7

Common Workflows

Load compiler and MPI
module load gcc/9.3.0 openmpi/4.1.0
Load software
module load python/3.9.7

Storage and Disk Usage

Check Usage

Home directory usage
df -h $HOME
Scratch space usage
df -h /gpfs/scratch/$USER
User quota information
myquota

File Management

Delete logs older than 30 days
find $HOME -name "*.log" -mtime +30 -delete
Create compressed archive
tar -czf archive.tar.gz directory/
Extract archive
tar -xzf archive.tar.gz

Performance Monitoring

During Execution

Monitor running job resources
sstat -j $SLURM_JOB_ID --format=AveCPU,AvePages,AveRSS,MaxRSS
View your processes
top -u $USER

After Completion

Detailed resource usage
sacct -j 123456 --format=JobID,MaxRSS,AveRSS,MaxVMSize,AveCPU,Elapsed,State
Job efficiency report
seff 123456

Common Troubleshooting

Job Pending (PD)?

  • Check resource availability: sinfo -p your_partition
  • Review job requirements: scontrol show job JOBID

Error Diagnostics

Check last 20 lines of output
tail -20 slurm-123456.out
Search for errors in output
grep -i error slurm-123456.out

Connection Issues

Verbose SSH for debugging
ssh -v username@milan.seawulf.stonybrook.edu