This KB Article References: High Performance Computing
GPU Resources Available
SeaWulf has the following GPU resources:
- 8 nodes containing 4 Tesla K80 GPUs each.
- 1 node with 2 Tesla P100 GPUs.
- 1 node with 2 Tesla V100 GPUs.
- 11 nodes each with 4 NVIDIA A100 GPUs, available via the Milan login nodes.
Note: There are no GPUs available on the login nodes, as they are not meant for computational workloads. Running the command nvidia-smi on the login node will produce an error.
Accessing GPU Nodes
To access the GPU nodes, you can submit to a GPU queue using the SLURM workload manager.
module load slurm/17.11.12
sbatch [...]
You can open an interactive shell onto a GPU node with the following command:
srun -J [job_name] -N 1 -p gpu --ntasks-per-node=28 --pty bash
Note: If you are using the A100 queues, you must request GPU allocations explicitly. For example, to request an interactive session with one GPU:
srun -J [job_name] -N 1 -p a100 --gpus=1 --pty bash
Similarly, in a SLURM job script, you would have to add the '--gpus' flag as follows:
#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res.txt #SBATCH -p a100 #SBATCH --gpus=1 ...
Using CUDA for GPU Acceleration
To take advantage of GPU acceleration with CUDA, load the appropriate modules and compile with NVCC:
module load cuda113/toolkit/11.3 # For K80, P100, and V100 nodes
# or
module load cuda120/toolkit/12.0 # For A100 nodes
nvcc INFILE -o OUTFILE
In the above, "INFILE" is meant to be an input file with code that is going to be compiled and "OUTFILE" is the name of the binary that will be produced.
For a sample CUDA program, see:
/gpfs/projects/samples/cuda/test.cu
Monitoring GPU Usage
Monitoring GPU usage is crucial for optimizing performance and ensuring efficient resource allocation. Two commonly used tools for monitoring GPU usage are:
nvidia-smi
NVIDIA System Management Interface (nvidia-smi) is a command-line utility provided by NVIDIA. It provides real-time monitoring and management of NVIDIA GPU devices. With nvidia-smi, users can monitor GPU utilization, memory usage, temperature, power consumption, and other relevant metrics. It also allows users to control various GPU settings and configurations.
Read the nvidia-smi documentation for more information.
nvtop
nvtop is a lightweight, interactive command-line utility for monitoring NVIDIA GPU processes. Similar to the popular system monitoring tool htop, nvtop provides a real-time overview of GPU utilization, memory usage, temperature, and GPU processes. It displays GPU utilization in a customizable, easy-to-read interface, allowing users to quickly identify bottlenecks and optimize GPU performance.
nvtop is available as a module for both the older NVIDIA GPUs (such as K80, P100, and V100) and the latest A100 GPUs. Load the module with the following command:
module load nvtop
Read the nvtop documentation for more information.
GPU Queues
The GPU queues have the following attributes:
Queue | Default run time | Max run time | Max # of nodes |
gpu | 1 hour | 8 hours | 2 |
gpu-long | 8 hours | 48 hours | 1 |
gpu-large | 1 hour | 8 hours | 4 |
p100 | 1 hour | 24 hours | 1 |
v100 | 1 hour | 24 hours | 1 |
a100 | 1 hour | 8 hours | 2 |
a100-long | 8 hours | 48 hours | 1 |
a100-large | 1 hour | 8 hours | 4 |