Monitoring and Optimizing Jobs on SeaWulf

Note: Monitoring helps you track how efficiently your jobs use resources. Use it to optimize performance and avoid wasting compute time.

Quick Reference Summary

For quick checks:

Use squeue -u <netid> to see your running jobs
Use seff <jobid> to check efficiency of completed jobs
Use /gpfs/software/hpc_tools/get_resource_usage.py for script-based summaries

For detailed real-time monitoring:

Find your node with squeue -u <netid>
SSH into the node: ssh <nodename>
Use glances (most comprehensive), htop (detailed), or top (basic)

For optimization:

Monitor CPU load values (should match core count for full utilization)
Check memory usage to avoid over-requesting resources
Consider shared queues for jobs that don't need full node resources
Adjust job scripts based on actual usage patterns

Introduction

Efficient resource utilization is essential for the effective use of our computing infrastructure on SeaWulf. This guide outlines several tools and scripts available for tracking Slurm job resource consumption. By using these tools, you can ensure your jobs are running efficiently, thereby helping to conserve resources and improve overall system performance.

Understanding CPU Load

An important aspect of optimization is understanding CPU load, which represents the average number of processes trying to use the CPU over a specified time interval. For example, on a fully utilized 40-core node, you would expect the load to be around 40. A significantly lower load might indicate underutilization of resources, while a much higher load likely points to oversubscription, potentially degrading code performance.

The CPU Load statistic is typically given as three values. These are the average sum of the number of processes waiting in the run-queue plus those currently executing over 1, 5, and 15-minute time periods respectively. By monitoring and adjusting CPU load, you can maintain optimal job efficiency and prevent performance bottlenecks, further enhancing resource utilization.

Common Monitoring Commands

Before diving into detailed tools, here are quick commands to check job status and efficiency:

Command	Purpose	Example
`squeue -u <netid>`	List your running and pending jobs	`squeue -u sam123`
`sacct -j <jobid> -l`	Show detailed stats about a completed or running job	`sacct -j 123456 -l`
`seff <jobid>`	See efficiency (CPU, memory) of completed jobs	`seff 123456`
`ssh <node>`	Log into allocated node to run real-time tools	`ssh dn045`

Monitoring Tools

get_resource_usage.py

The get_resource_usage.py script is a tool designed to help users monitor the resource usage of their Slurm jobs on SeaWulf. It provides a concise summary of CPU and memory utilization, making it easier to identify inefficiencies and optimize resource usage.

Script Location:

/gpfs/software/hpc_tools/get_resource_usage.py

Basic Usage:

For most users, running the script without arguments will provide all necessary information:

/gpfs/software/hpc_tools/get_resource_usage.py

Command-Line Options:

The script comes with several options to tailor the output to your needs:

/gpfs/software/hpc_tools/get_resource_usage.py --help

Available options:

-h, --help - Show help message and exit
-u USER, --user USER - Only report usage for this user
-l LOW, --low LOW - Only report nodes with % CPU usage lower than this value
-e HIGH, --high HIGH - Only report nodes with % CPU usage higher than this value
-n NODE, --node NODE - Only report usage on this node
-j JOB, --job JOB - Only report usage for this job ID

Example Output:

In a typical example, you might see output showing one Intel Skylake 40-core node allocated, utilizing less than 26.5% of the CPU with a CPU load of 10.58, and memory usage at a meager 14.2%. This output would suggest an inefficient use of the compute node, indicating that resources could be better optimized.

Real-Time Monitoring via SSH

To see live CPU and memory usage, you need to identify your job's node with squeue and SSH into it. These tools provide thorough insights into CPU and memory usage, which is crucial for keeping an eye on performance and making necessary optimizations.

Accessing Your Compute Node:

squeue -u <netid>

This command displays a list of your active jobs (assuming the Slurm module is loaded). The output shows job information including the node name where your job is running:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
485898 short-40c     bash ssperrot  R       0:23      1 dn045

Once you identify the node, SSH into it:

ssh dn045

By directly accessing the nodes your jobs are currently running on via SSH, you can get instant updates on resource allocation, spot any potential bottlenecks, and address them promptly to improve efficiency. This lets you monitor usage dynamically, for example, checking RES (resident memory) in top.

glances

glances is, in most cases, the best tool for real-time system resource monitoring on SeaWulf. It offers a detailed interface with comprehensive system resource monitoring capabilities, making it an attractive choice for users who prefer straightforward yet powerful tools. It helps in the identification of resource-intensive processes and system bottlenecks, facilitating effective troubleshooting.

Using glances:

After SSH-ing into your compute node, load the module:

module load glances

Then execute the command:

glances

What to Look For:

Using the glances command, you can easily spot inefficiencies in resource allocation such as certain processes monopolizing CPU resources or excessive memory usage, indicating potential optimization opportunities. When properly configured, you should see optimal resource utilization across the node.

Additionally, glances offers built-in plugins that provide additional functionalities such as network and disk I/O monitoring, expanding its utility beyond basic system resource tracking. This versatility enhances its effectiveness in providing a comprehensive overview of CPU, memory, and process data, aiding in efficient system performance analysis.

Read the glances documentation for more information on the available features.

htop

The htop command has been a reliable tool for monitoring real-time system resource usage on SeaWulf for many years. It offers a comprehensive display of CPU, memory, and process data, providing a detailed insight into system performance.

Using htop:

After SSH-ing into your compute node, load the module:

module load htop

Then execute the command:

htop

Identifying Inefficiencies:

Using htop, you can readily identify inefficiencies in resource utilization. For example, if you observe a program like Amber's pmemd command occupying only a single core on a multi-core node, this highlights inefficient utilization of the computing node, leading to sub-optimal performance.

Observing Efficient Usage:

Conversely, htop will also help you observe programs efficiently employing MPI to distribute tasks across 40 concurrent processes, fully utilizing all available CPU cores on the node for optimal performance.

Read the htop documentation for more information on the available features.

top

If you only need a subset of the features outlined by the previous tools, the top command serves as a fundamental tool to monitor real-time system resource usage on SeaWulf. It presents a clear overview of CPU, memory, and process data, allowing users to detect anomalies or resource-intensive processes without requiring additional modules.

Using top:

After SSH-ing into your compute node, simply run:

top

This command requires no additional module loading and provides basic but essential monitoring capabilities.

Read the top documentation for more information on the available features.

Optimizing Resource Usage

We hope you'll make use of these tools to ensure your jobs are efficiently utilizing the resources you've requested. If you find any discrepancies or inefficiencies, don't hesitate to take action to improve your resource usage. This could involve:

Refining your job configurations
Adjusting resource requests based on observed usage
Optimizing your code to better match the allocated resources

If you encounter any challenges or need guidance in enhancing your resource efficiency, our support team is here to assist you every step of the way.

Click here to submit a ticket to the HPC support site

Shared Queues

Lastly, we want to highlight that if you find that your job doesn't need all or most of the resources on a node, we encourage you to utilize the "shared" queues. These queues allow for more efficient resource allocation by enabling multiple jobs to run simultaneously on the same node, maximizing resource utilization. For more information on how to use the "shared" queues and optimize your job submissions, please refer to our comprehensive FAQ article, where you'll find detailed guidance and instructions.

Monitoring and Optimizing Jobs on SeaWulf

Monitoring and Optimizing Resource Usage

Quick Reference Summary

Introduction

Understanding CPU Load

Common Monitoring Commands

Monitoring Tools

get_resource_usage.py

Real-Time Monitoring via SSH

glances

htop

top

Optimizing Resource Usage

Shared Queues