Job Management on SeaWulf

Job Management

This guide covers everything you need to monitor, manage, and optimize your jobs on SeaWulf. Whether you're checking job status, monitoring resource usage, or canceling stuck jobs, you'll find the commands and tools you need here.

Monitoring Jobs

Start here to check the status of your jobs and see what's running, pending, or completed.

Common Monitoring Commands

Command Purpose Example
squeue --user=$USER List your running and pending jobs squeue --user=sam123
scontrol show job <jobid> Detailed job information scontrol show job 123456
sacct -j <jobid> -l Show detailed stats about a completed or running job sacct -j 123456 -l
seff <jobid> See efficiency (CPU, memory) of completed jobs seff 123456

Modifying Jobs

Need to adjust a job that's already submitted? You can modify certain parameters before or while jobs are running.

Hold and Release Jobs

Place a job on hold to prevent it from running:

scontrol hold 123456

Release a held job back to the queue:

scontrol release 123456

Canceling Jobs

If a job can't be modified or needs to be stopped, you can cancel it and resubmit with corrected settings.

Cancel a Specific Job

scancel 123456

Cancels the job with ID 123456.

Cancel All Your Jobs

scancel --user=$USER

Cancels all jobs belonging to your account.

Cancel Jobs by Name

scancel --name=my_job

Cancels all jobs with the specified job name.

Cancel Jobs by State

scancel --user=$USER --state=PENDING

Cancels all your pending (queued) jobs.

Checking Job Efficiency and Resource Usage

Once you understand your job's basic status, dive deeper into how efficiently it's using resources.

Understanding CPU Load

CPU load represents the average number of processes trying to use the CPU over a specified time interval. For example, on a fully utilized 40-core node, you would expect the load to be around 40. A significantly lower load might indicate underutilization of resources, while a much higher load likely points to oversubscription, potentially degrading code performance.

The CPU Load statistic is typically given as three values: the average sum of the number of processes waiting in the run-queue plus those currently executing over 1, 5, and 15-minute time periods respectively. By monitoring and adjusting CPU load, you can maintain optimal job efficiency and prevent performance bottlenecks.

Check Job Efficiency

seff 123456

Shows efficiency metrics for completed jobs, including CPU utilization and memory usage. This helps you determine if you're requesting the right amount of resources.

Job Accounting Information

sacct -j 123456 -l

Displays detailed accounting data for completed or running jobs, including CPU time, memory usage, and job states.

Format specific fields:

sacct -j 123456 --format=JobID,JobName,State,Elapsed,MaxRSS,CPUTime

Customize the output to show only the fields you need.

get_resource_usage.py Script

SeaWulf provides a built-in script to summarize CPU and memory usage per job:

/gpfs/software/hpc_tools/get_resource_usage.py

Example Output:

In this example, one Intel Skylake 40-core node is allocated, utilizing less than 26.5% of the CPU with a CPU load of 10.58, and memory usage at 14.2%. This suggests inefficient use of the compute node.

Real-Time Resource Monitoring

For active debugging and performance analysis, connect directly to compute nodes to monitor resources in real-time.

Step 1: Find Your Node

squeue --user=$USER

Note the node name from the NODELIST column (e.g., dn045).

Example output:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
485898 short-40c     bash ssperrot  R       0:23      1 dn045

Step 2: SSH to the Node

ssh dn045

By directly accessing the nodes your jobs are currently running on via SSH, you can get instant updates on resource allocation, spot any potential bottlenecks, and address them promptly to improve efficiency.

Step 3: Monitor Resources

glances (Recommended)

glances is, in most cases, the best tool for real-time system resource monitoring on SeaWulf. It offers a detailed interface with comprehensive system resource monitoring capabilities.

After SSH-ing into your compute node:

module load glances
glances

Using glances, you can easily spot inefficiencies in resource allocation such as certain processes monopolizing CPU resources or excessive memory usage, indicating potential optimization opportunities. Additionally, glances offers built-in plugins that provide additional functionalities such as network and disk I/O monitoring.

Read the glances documentation for more information.

htop

The htop command has been a reliable tool for monitoring real-time system resource usage on SeaWulf for many years. It offers a comprehensive display of CPU, memory, and process data.

After SSH-ing into your compute node:

module load htop
htop

Identifying Inefficiencies: Using htop, you can readily identify inefficiencies in resource utilization. For example, if you observe a program like Amber's pmemd command occupying only a single core on a multi-core node, this highlights inefficient utilization.

Observing Efficient Usage: Conversely, htop will help you observe programs efficiently employing MPI to distribute tasks across 40 concurrent processes, fully utilizing all available CPU cores on the node.

Read the htop documentation for more information.

top

If you only need basic monitoring, the top command serves as a fundamental tool to monitor real-time system resource usage. It presents a clear overview of CPU, memory, and process data without requiring additional modules.

After SSH-ing into your compute node:

top

Look for your processes and check CPU percentage and resident memory (RES column). Read the top documentation for more information.

Optimizing Resource Usage

Use these monitoring tools to ensure your jobs are efficiently utilizing the resources you've requested. If you find any discrepancies or inefficiencies, take action to improve your resource usage:

  • Refining your job configurations
  • Adjusting resource requests based on observed usage
  • Optimizing your code to better match the allocated resources

Shared Queues

If you find that your job doesn't need all or most of the resources on a node, we encourage you to utilize the "shared" queues. These queues allow for more efficient resource allocation by enabling multiple jobs to run simultaneously on the same node, maximizing resource utilization.

Quick Reference for Optimization

  • Monitor CPU load values - Should match core count for full utilization
  • Check memory usage - Avoid over-requesting resources
  • Consider shared queues - For jobs that don't need full node resources
  • Adjust job scripts - Based on actual usage patterns from seff and monitoring tools

Need Help?

If you encounter any challenges or need guidance in enhancing your resource efficiency, our support team is here to assist you every step of the way.

Click here to submit a ticket to the HPC support site