Monitoring Jobs
Monitoring helps you track how efficiently your jobs use resources. Use these tools to optimize performance and avoid wasting compute time.
Basic Job Status
List your jobs:
squeue --user=$USER
Shows all your running and pending jobs with their status, partition, and node allocation.
Check specific job:
squeue --job 123456
Shows details for a specific job ID.
Detailed job information:
scontrol show job 123456
Displays comprehensive information including resource allocation, start time, and working directory.
Job Efficiency and Statistics
Check Job Efficiency
seff 123456
Shows efficiency metrics for completed jobs, including CPU utilization and memory usage. This helps you determine if you're requesting the right amount of resources.
Job Accounting Information
sacct -j 123456 -l
Displays detailed accounting data for completed or running jobs, including CPU time, memory usage, and job states.
Format specific fields:
sacct -j 123456 --format=JobID,JobName,State,Elapsed,MaxRSS,CPUTime
Customize the output to show only the fields you need.
Real-Time Resource Monitoring
To see live CPU and memory usage, identify your job's node and SSH into it:
Step 1: Find Your Node
squeue --user=$USER
Note the node name from the NODELIST column (e.g., dn045
).
Step 2: SSH to the Node
ssh dn045
Step 3: Monitor Resources
top
Or use other monitoring tools:
htop # Interactive process viewer glances # System monitoring tool
Look for your processes and check CPU percentage and resident memory (RES column in top
).
SeaWulf Resource Usage Script
SeaWulf provides a built-in script to summarize CPU and memory usage per job:
/gpfs/software/hpc_tools/get_resource_usage.py
Canceling Jobs
Cancel a Specific Job
scancel 123456
Cancels the job with ID 123456.
Cancel All Your Jobs
scancel --user=$USER
Cancels all jobs belonging to your account.
Cancel Jobs by Name
scancel --name=my_job
Cancels all jobs with the specified job name.
Cancel Jobs by State
scancel --user=$USER --state=PENDING
Cancels all your pending (queued) jobs.
Modifying Jobs
Hold and Release Jobs
Place a job on hold to prevent it from running:
scontrol hold 123456
Release a held job back to the queue:
scontrol release 123456
Update Job Parameters
Some job parameters can be modified after submission using scontrol update
:
Change time limit:
scontrol update JobId=123456 TimeLimit=04:00:00
Change job name:
scontrol update JobId=123456 JobName=new_name
Change partition:
scontrol update JobId=123456 Partition=long-40core
Common Monitoring Commands Summary
Command | Purpose |
---|---|
squeue --user=$USER |
List your running and pending jobs |
scontrol show job <jobid> |
Detailed job information |
sacct -j <jobid> -l |
Accounting info for completed/running jobs |
seff <jobid> |
CPU and memory efficiency of completed jobs |
ssh <node> then top |
Real-time resource monitoring on compute node |
scancel <jobid> |
Cancel a specific job |
scontrol hold <jobid> |
Prevent job from running |
scontrol release <jobid> |
Allow held job to run |
Best Practices
Monitor Regularly
Use squeue
and sacct
regularly to check job status and historical usage. This helps you understand queue times and resource availability.
Check Efficiency
After jobs complete, use seff
to see if they efficiently used CPU and memory. Adjust future job requests based on actual usage patterns.
Cancel Stuck Jobs
If a job is stuck in the queue with incorrect resources, cancel it and resubmit with adjusted parameters.
Use Real-Time Monitoring Sparingly
SSH monitoring is useful for debugging, but don't leave monitoring sessions open unnecessarily. Exit when you're done checking.
Adjust Resource Requests
Based on monitoring data, refine your resource requests to improve cluster efficiency and reduce queue times.
- Use
squeue
to quickly check if your jobs are running or queued - Try
seff
after jobs complete to identify under- or over-utilization - For live monitoring, SSH to the node and use
htop
orglances
for better visualization - The
get_resource_usage.py
script provides a convenient summary without SSH access - If you frequently need to cancel multiple jobs, use filters with
scancel
rather than canceling one at a time