Computing Power Glossary - GPU & Cloud Terms Explained

A100

[ay-one-hundred]

NVIDIA's flagship data center GPU based on the Ampere architecture, released in 2020. Available in 40GB and 80GB HBM2e memory configurations.

Example:

"We need 8x A100 80GB GPUs to train our 70B parameter model efficiently."

Ampere Architecture

NVIDIA's GPU microarchitecture generation released in 2020, featuring improved tensor cores and support for sparsity acceleration.

Batch Size

The number of training samples processed together in one forward/backward pass during neural network training. Larger batch sizes can improve GPU utilization but require more VRAM.

Example:

"Increasing the batch size from 32 to 128 improved our H100 utilization from 60% to 95%."

CUDA

[koo-dah]

Compute Unified Device Architecture - NVIDIA's parallel computing platform and API that allows software to use GPUs for general-purpose processing.

Example:

"This model requires CUDA 11.8 or higher for optimal performance."

CUDA Cores

The parallel processing units within NVIDIA GPUs. More CUDA cores generally means better parallel processing performance.

FLOPS

[flops]

Floating Point Operations Per Second - A measure of computer performance, especially in scientific computations. Common scales: TFLOPS (trillion), PFLOPS (quadrillion), EFLOPS (quintillion).

Example:

"The H100 delivers up to 989 TFLOPS of FP16 tensor performance."

FP16 (Half Precision)

16-bit floating-point format that uses half the memory of FP32, enabling faster computation and reduced memory usage with minimal accuracy loss for many AI workloads.

GPU

[jee-pee-you]

Graphics Processing Unit - A specialized processor originally designed for graphics rendering, now widely used for parallel computing tasks including AI/ML training and inference.

H100

[aych-one-hundred]

NVIDIA's latest flagship data center GPU based on the Hopper architecture, released in 2022. Features 80GB of HBM3 memory and delivers up to 3x the performance of the A100.

Example:

"The H100's FP8 support cuts our training time in half compared to the A100."

HBM (High Bandwidth Memory)

A type of memory interface for 3D-stacked DRAM that provides significantly higher bandwidth than traditional GDDR memory. HBM3 in the H100 provides 3TB/s of bandwidth.

Inference

The process of using a trained neural network model to make predictions on new data. Generally requires less compute power than training.

Example:

"For inference workloads, a single A100 can serve 100+ concurrent users."

LLM (Large Language Model)

AI models with billions of parameters trained on vast text datasets. Examples include GPT-4, LLaMA, and Claude. Requires significant GPU resources for training and inference.

Example:

"Training a 70B parameter LLM requires at least 16x A100 80GB GPUs."

Mixed Precision Training

A technique that uses both FP16 and FP32 computations to accelerate training while maintaining model accuracy. Can provide 2-3x speedup on modern GPUs.

NVLink

NVIDIA's high-speed interconnect technology for direct GPU-to-GPU communication. NVLink 4.0 in H100 provides 900 GB/s bidirectional bandwidth.

Parameters

The learnable weights in a neural network. Model size is often described by parameter count (e.g., 7B, 70B, 175B parameters).

Example:

"Each billion parameters requires roughly 2GB of memory in FP16 format."

Quantization

The process of reducing the precision of model weights and activations (e.g., from FP16 to INT8) to reduce memory usage and increase inference speed.

Example:

"4-bit quantization allows us to run a 70B model on a single A100 40GB."

Spot Instance

Unused cloud compute capacity available at significant discounts (up to 90% off) but can be interrupted with short notice.

Example:

"We use spot instances for batch inference jobs, saving 70% on GPU costs."

Tensor Cores

Specialized processing units in NVIDIA GPUs designed for matrix multiplication operations, providing significant acceleration for AI workloads.

TFLOPS

[tee-flops]

Trillion (10^12) Floating Point Operations Per Second. Common metric for measuring GPU compute performance.

TPU

[tee-pee-you]

Tensor Processing Unit - Google's custom-developed ASIC specifically designed for neural network machine learning. TPU v4 offers comparable performance to A100.

Example:

"Google Cloud's TPU v4 pods can deliver up to 1.1 exaflops of compute."

VRAM

[vee-ram]

Video Random Access Memory - The dedicated high-speed memory on a GPU. The amount of VRAM determines the maximum model size and batch size you can process.

Example:

"A 70B parameter model requires at least 140GB of VRAM in FP16, so you need multiple GPUs."