How do I use the GPU nodes on Seawulf?

GPU Resources Available

SeaWulf has the following GPU resources:

  • 8 nodes containing 4 Tesla K80 GPUs each.
  • 1 node with 2 Tesla P100 GPUs.
  • 1 node with 2 Tesla V100 GPUs.
  • 11 nodes each with 4 NVIDIA A100 GPUs, available via the Milan login nodes.

Note: There are no GPUs available on the login nodes, as they are not meant for computational workloads. Running the command nvidia-smi on the login node will produce an error. 

 

Accessing GPU Nodes

To access the GPU nodes, you can submit to a GPU queue using the SLURM workload manager.

module load slurm/17.11.12
sbatch [...]


You can open an interactive shell onto a GPU node with the following command:

srun -J [job_name] -N 1 -p gpu --ntasks-per-node=28 --pty bash

 

Note: If you are using the A100 queues, you must request GPU allocations explicitly. For example, to request an interactive session with one GPU:

srun -J [job_name] -N 1 -p a100 --gpus=1 --pty bash

Similarly, in a SLURM job script, you would have to add the '--gpus' flag as follows:

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#SBATCH -p a100
#SBATCH --gpus=1
...

 

Using CUDA for GPU Acceleration

To take advantage of GPU acceleration with CUDA, load the appropriate modules and compile with NVCC:

module load cuda113/toolkit/11.3   # For K80, P100, and V100 nodes
# or
module load cuda120/toolkit/12.0   # For A100 nodes
nvcc INFILE -o OUTFILE

In the above, "INFILE" is meant to be an input file with code that is going to be compiled  and "OUTFILE" is the name of the binary that will be produced. 


For a sample CUDA program, see:

 /gpfs/projects/samples/cuda/test.cu

 

Monitoring GPU Usage

Monitoring GPU usage is crucial for optimizing performance and ensuring efficient resource allocation. Two commonly used tools for monitoring GPU usage are:

nvidia-smi

NVIDIA System Management Interface (nvidia-smi) is a command-line utility provided by NVIDIA. It provides real-time monitoring and management of NVIDIA GPU devices. With nvidia-smi, users can monitor GPU utilization, memory usage, temperature, power consumption, and other relevant metrics. It also allows users to control various GPU settings and configurations.

Read the nvidia-smi documentation for more information.

nvtop

nvtop is a lightweight, interactive command-line utility for monitoring NVIDIA GPU processes. Similar to the popular system monitoring tool htop, nvtop provides a real-time overview of GPU utilization, memory usage, temperature, and GPU processes. It displays GPU utilization in a customizable, easy-to-read interface, allowing users to quickly identify bottlenecks and optimize GPU performance.

nvtop is available as a module for both the older NVIDIA GPUs (such as K80, P100, and V100) and the latest A100 GPUs. Load the module with the following command:

module load nvtop

Read the nvtop documentation for more information.

 

GPU Queues

The GPU queues have the following attributes:

Queue Default run time Max run time Max # of nodes
gpu 1 hour 8 hours 2
gpu-long 8 hours 48 hours 1
gpu-large 1 hour 8 hours 4
p100 1 hour 24 hours 1
v100 1 hour 24 hours 1
a100 1 hour 8 hours 2
a100-long 8 hours 48 hours 1
a100-large 1 hour 8 hours 4

 

Article Topic

 

Still Need Help? The best way to report your issue or make a request is by submitting a ticket.