Quick Navigation: Jump to Monitoring Jobs | Modifying Jobs | Canceling Jobs | Checking Efficiency | Real-Time Tools | Optimization
This guide covers everything you need to monitor, manage, and optimize your jobs on SeaWulf. Whether you're checking job status, monitoring resource usage, or canceling stuck jobs, you'll find the commands and tools you need here.
Monitoring Jobs
Start here to check the status of your jobs and see what's running, pending, or completed.
Common Monitoring Commands
| Command | Purpose | Example |
|---|---|---|
squeue --user=$USER |
List your running and pending jobs | squeue --user=sam123 |
scontrol show job <jobid> |
Detailed job information | scontrol show job 123456 |
sacct -j <jobid> -l |
Show detailed stats about a completed or running job | sacct -j 123456 -l |
seff <jobid> |
See efficiency (CPU, memory) of completed jobs | seff 123456 |
Modifying Jobs
Need to adjust a job that's already submitted? You can modify certain parameters before or while jobs are running.
Hold and Release Jobs
Place a job on hold to prevent it from running:
scontrol hold 123456
Release a held job back to the queue:
scontrol release 123456
Canceling Jobs
If a job can't be modified or needs to be stopped, you can cancel it and resubmit with corrected settings.
Cancel a Specific Job
scancel 123456
Cancels the job with ID 123456.
Cancel All Your Jobs
scancel --user=$USER
Cancels all jobs belonging to your account.
Cancel Jobs by Name
scancel --name=my_job
Cancels all jobs with the specified job name.
Cancel Jobs by State
scancel --user=$USER --state=PENDING
Cancels all your pending (queued) jobs.
Checking Job Efficiency and Resource Usage
Once you understand your job's basic status, dive deeper into how efficiently it's using resources.
Understanding CPU Load
CPU load represents the average number of processes trying to use the CPU over a specified time interval. For example, on a fully utilized 40-core node, you would expect the load to be around 40. A significantly lower load might indicate underutilization of resources, while a much higher load likely points to oversubscription, potentially degrading code performance.
The CPU Load statistic is typically given as three values: the average sum of the number of processes waiting in the run-queue plus those currently executing over 1, 5, and 15-minute time periods respectively. By monitoring and adjusting CPU load, you can maintain optimal job efficiency and prevent performance bottlenecks.
Check Job Efficiency
seff 123456
Shows efficiency metrics for completed jobs, including CPU utilization and memory usage. This helps you determine if you're requesting the right amount of resources.
Job Accounting Information
sacct -j 123456 -l
Displays detailed accounting data for completed or running jobs, including CPU time, memory usage, and job states.
Format specific fields:
sacct -j 123456 --format=JobID,JobName,State,Elapsed,MaxRSS,CPUTime
Customize the output to show only the fields you need.
get_resource_usage.py Script
SeaWulf provides a built-in script to summarize CPU and memory usage per job:
/gpfs/software/hpc_tools/get_resource_usage.py
Example Output:

In this example, one Intel Skylake 40-core node is allocated, utilizing less than 26.5% of the CPU with a CPU load of 10.58, and memory usage at 14.2%. This suggests inefficient use of the compute node.
Real-Time Resource Monitoring
For active debugging and performance analysis, connect directly to compute nodes to monitor resources in real-time.
Step 1: Find Your Node
squeue --user=$USER
Note the node name from the NODELIST column (e.g., dn045).
Example output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 485898 short-40c bash ssperrot R 0:23 1 dn045
Step 2: SSH to the Node
ssh dn045
By directly accessing the nodes your jobs are currently running on via SSH, you can get instant updates on resource allocation, spot any potential bottlenecks, and address them promptly to improve efficiency.
Step 3: Monitor Resources
glances (Recommended)
glances is, in most cases, the best tool for real-time system resource monitoring on SeaWulf. It offers a detailed interface with comprehensive system resource monitoring capabilities.
After SSH-ing into your compute node:
module load glances glances

Using glances, you can easily spot inefficiencies in resource allocation such as certain processes monopolizing CPU resources or excessive memory usage, indicating potential optimization opportunities. Additionally, glances offers built-in plugins that provide additional functionalities such as network and disk I/O monitoring.
Read the glances documentation for more information.
htop
The htop command has been a reliable tool for monitoring real-time system resource usage on SeaWulf for many years. It offers a comprehensive display of CPU, memory, and process data.
After SSH-ing into your compute node:
module load htop htop

Identifying Inefficiencies: Using htop, you can readily identify inefficiencies in resource utilization. For example, if you observe a program like Amber's pmemd command occupying only a single core on a multi-core node, this highlights inefficient utilization.
Observing Efficient Usage: Conversely, htop will help you observe programs efficiently employing MPI to distribute tasks across 40 concurrent processes, fully utilizing all available CPU cores on the node.
Read the htop documentation for more information.
top
If you only need basic monitoring, the top command serves as a fundamental tool to monitor real-time system resource usage. It presents a clear overview of CPU, memory, and process data without requiring additional modules.
After SSH-ing into your compute node:
top

Look for your processes and check CPU percentage and resident memory (RES column). Read the top documentation for more information.
Optimizing Resource Usage
Use these monitoring tools to ensure your jobs are efficiently utilizing the resources you've requested. If you find any discrepancies or inefficiencies, take action to improve your resource usage:
- Refining your job configurations
- Adjusting resource requests based on observed usage
- Optimizing your code to better match the allocated resources
Shared Queues
If you find that your job doesn't need all or most of the resources on a node, we encourage you to utilize the "shared" queues. These queues allow for more efficient resource allocation by enabling multiple jobs to run simultaneously on the same node, maximizing resource utilization.
Quick Reference for Optimization
- Monitor CPU load values - Should match core count for full utilization
- Check memory usage - Avoid over-requesting resources
- Consider shared queues - For jobs that don't need full node resources
- Adjust job scripts - Based on actual usage patterns from
seffand monitoring tools
Need Help?
If you encounter any challenges or need guidance in enhancing your resource efficiency, our support team is here to assist you every step of the way.
