Job Management on SeaWulf

Job Management

Monitoring Jobs

Monitoring helps you track how efficiently your jobs use resources. Use these tools to optimize performance and avoid wasting compute time.

Basic Job Status

List your jobs:

squeue --user=$USER

Shows all your running and pending jobs with their status, partition, and node allocation.

Check specific job:

squeue --job 123456

Shows details for a specific job ID.

Detailed job information:

scontrol show job 123456

Displays comprehensive information including resource allocation, start time, and working directory.

Job Efficiency and Statistics

Check Job Efficiency

seff 123456

Shows efficiency metrics for completed jobs, including CPU utilization and memory usage. This helps you determine if you're requesting the right amount of resources.

Job Accounting Information

sacct -j 123456 -l

Displays detailed accounting data for completed or running jobs, including CPU time, memory usage, and job states.

Format specific fields:

sacct -j 123456 --format=JobID,JobName,State,Elapsed,MaxRSS,CPUTime

Customize the output to show only the fields you need.

Real-Time Resource Monitoring

To see live CPU and memory usage, identify your job's node and SSH into it:

Step 1: Find Your Node

squeue --user=$USER

Note the node name from the NODELIST column (e.g., dn045).

Step 2: SSH to the Node

ssh dn045

Step 3: Monitor Resources

top

Or use other monitoring tools:

htop       # Interactive process viewer
glances    # System monitoring tool

Look for your processes and check CPU percentage and resident memory (RES column in top).

SeaWulf Resource Usage Script

SeaWulf provides a built-in script to summarize CPU and memory usage per job:

/gpfs/software/hpc_tools/get_resource_usage.py

Canceling Jobs

Cancel a Specific Job

scancel 123456

Cancels the job with ID 123456.

Cancel All Your Jobs

scancel --user=$USER

Cancels all jobs belonging to your account.

Cancel Jobs by Name

scancel --name=my_job

Cancels all jobs with the specified job name.

Cancel Jobs by State

scancel --user=$USER --state=PENDING

Cancels all your pending (queued) jobs.

Modifying Jobs

Hold and Release Jobs

Place a job on hold to prevent it from running:

scontrol hold 123456

Release a held job back to the queue:

scontrol release 123456

Update Job Parameters

Some job parameters can be modified after submission using scontrol update:

Change time limit:

scontrol update JobId=123456 TimeLimit=04:00:00

Change job name:

scontrol update JobId=123456 JobName=new_name

Change partition:

scontrol update JobId=123456 Partition=long-40core
Important: Not all changes are permitted once a job has started running. Most modifications only work on pending jobs. If a modification fails, you may need to cancel and resubmit the job with corrected settings.

Common Monitoring Commands Summary

Command Purpose
squeue --user=$USER List your running and pending jobs
scontrol show job <jobid> Detailed job information
sacct -j <jobid> -l Accounting info for completed/running jobs
seff <jobid> CPU and memory efficiency of completed jobs
ssh <node> then top Real-time resource monitoring on compute node
scancel <jobid> Cancel a specific job
scontrol hold <jobid> Prevent job from running
scontrol release <jobid> Allow held job to run

Best Practices

Monitor Regularly

Use squeue and sacct regularly to check job status and historical usage. This helps you understand queue times and resource availability.

Check Efficiency

After jobs complete, use seff to see if they efficiently used CPU and memory. Adjust future job requests based on actual usage patterns.

Cancel Stuck Jobs

If a job is stuck in the queue with incorrect resources, cancel it and resubmit with adjusted parameters.

Use Real-Time Monitoring Sparingly

SSH monitoring is useful for debugging, but don't leave monitoring sessions open unnecessarily. Exit when you're done checking.

Adjust Resource Requests

Based on monitoring data, refine your resource requests to improve cluster efficiency and reduce queue times.

Tips:
  • Use squeue to quickly check if your jobs are running or queued
  • Try seff after jobs complete to identify under- or over-utilization
  • For live monitoring, SSH to the node and use htop or glances for better visualization
  • The get_resource_usage.py script provides a convenient summary without SSH access
  • If you frequently need to cancel multiple jobs, use filters with scancel rather than canceling one at a time