HPC Glossary

HPC and SeaWulf Glossary

A reference guide to understand the language of high-performance computing, job scheduling, and the SeaWulf cluster environment.

Core HPC Concepts

Cluster

A collection of interconnected computers (nodes) that work together as a single system to provide increased computational power and reliability. Clusters enable parallel processing by distributing tasks across multiple machines.

High-Performance Computing (HPC)

The practice of aggregating computing power to solve complex computational problems that require significant processing resources. HPC systems are characterized by their ability to process data and execute calculations at rates far exceeding standard commercial computers.

Node

An individual computer within a cluster. Each node typically contains processors (CPUs), memory (RAM), and storage, and can operate independently while contributing to the overall cluster performance.

Core

An individual processing unit within a CPU. Modern processors contain multiple cores, allowing them to execute multiple tasks simultaneously. SeaWulf nodes range from 40-core to 96-core configurations.

Parallel Processing

An approach that involves several computers breaking down individual tasks to work together simultaneously, dramatically reducing the time needed to complete large computational tasks.

petaFLOPS

A unit of computational speed equal to one quadrillion (10¹⁵) floating-point operations per second. SeaWulf achieves 1.86 petaFLOPS peak performance, indicating its massive computational capability.

Hardware Components

CPU (Central Processing Unit)

The main processor that executes instructions and performs calculations. Modern HPC systems often feature thousands of CPU cores working in parallel.

GPU (Graphics Processing Unit)

Originally designed for graphics rendering, GPUs excel at parallel processing and are widely used in HPC for accelerating certain types of computations, particularly in machine learning and scientific simulations.

High-Bandwidth Memory (HBM)

Advanced memory technology that provides significantly faster data transfer between memory and processors compared to traditional RAM. SeaWulf features Intel Xeon CPU Max Series processors with HBM technology.

InfiniBand

A high-performance networking technology used to interconnect nodes in HPC clusters. InfiniBand provides extremely low latency and high bandwidth communication between cluster components.

Interconnect

The network infrastructure that connects all nodes in a cluster, enabling high-speed communication and data transfer. The quality of the interconnect significantly impacts cluster performance.

Storage and File Systems

GPFS (General Parallel File System)

IBM's high-performance shared-disk file system designed for large-scale computing environments. GPFS provides concurrent access to files across all nodes in a cluster and is used for SeaWulf's storage infrastructure.

Parallel File System

A distributed storage system that allows multiple nodes to simultaneously access the same files, providing high throughput and scalability for large datasets.

Scratch Space

High-performance temporary storage used for job input/output operations. Scratch space is typically faster than home directories but may have data retention limits.

Storage Array

A collection of storage devices (hard drives, SSDs) that work together to provide large-capacity, reliable data storage for the cluster.

Job Scheduling and Management

SLURM (Simple Linux Utility for Resource Management)

An open-source, fault-tolerant, and highly scalable cluster management and job scheduling system used on SeaWulf to allocate resources and manage computational jobs.

Job

A computational task or set of tasks submitted to the cluster for execution. Jobs specify resource requirements and are queued until appropriate resources become available.

Queue (Partition)

A logical grouping of nodes with similar characteristics or intended uses. Different queues may have different priorities, time limits, and access policies.

Job Script

A file containing both resource requirements (specified with SLURM directives) and the commands to be executed. Job scripts are submitted to the scheduler using commands like `sbatch`.

Scheduler

Software that manages job submission, queuing, and execution by allocating available resources to waiting jobs based on priority, resource requirements, and availability.

Resource Allocation

A set of resources (compute, storage, time) to which users have been granted access, including specifications of cluster access, compute time, and storage quotas.

SeaWulf-Specific Terms

Login Nodes

Special nodes that provide user access to the SeaWulf cluster. Users connect to login nodes to submit jobs, transfer files, and perform light computational tasks, but should not run intensive computations directly on these nodes.

Compute Nodes

The worker nodes in SeaWulf where actual computational jobs are executed. These nodes are optimized for high-performance computing and are accessed through the job scheduler.

Queue Types

SeaWulf Queue Categories:
  • Short queues: For quick jobs and testing
  • Long queues: For extended computational runs
  • GPU queues: For GPU-accelerated workloads
  • HBM queues: For memory-intensive applications
  • Shared queues: For efficient partial-node utilization

Node Types

SeaWulf features heterogeneous node types optimized for different workloads: 40-core and 96-core CPU nodes, GPU nodes with K80 accelerators, high-memory nodes with up to 1TB RAM, and HBM nodes with high-bandwidth memory.

Performance and Optimization

Throughput

Maximum computational throughput achieved by leveraging tightly integrated clusters of high-end processors, accelerators, memory, and interconnects.

Scalability

The ability of an HPC system to maintain or improve performance as resources (nodes, processors, memory) are added to handle larger computational problems.

Load Balancing

The distribution of computational work across multiple nodes or processors to optimize resource utilization and minimize job completion time.

Benchmarking

The process of testing and measuring HPC system performance using standardized tests to evaluate computational speed, memory bandwidth, and network performance.

Fault Tolerance

The ability of an HPC system to continue operating and completing jobs even when individual components fail, ensuring reliability for long-running computations.

Important Note: This glossary provides general definitions for HPC and SeaWulf-specific terms. For detailed usage instructions and current system specifications, consult the official SeaWulf documentation or contact research computing support.