Demystifying GPUs: From Core Architecture to Scalable Systems

Table of Contents

Motivation
Optimization goal of GPUs
Key concepts of GPUs - software and hardware
Deeper dive into a warp
How are tensor cores different from CUDA cores?
Mapping software to hardware
Why do warps exist?
Why is a block restricted to running on a single SM?
What is the difference between L1 cache and shared memory?
What if a single block cannot fit within a single SM?
Scaling compute across GPUs
Explanation of notations 16x and 4x

Motivation

Trying to understand how GPUs work, but confused by all the jargons? This article explains:

What are GPUs optimised for
Key concepts to understand how a single GPU works
How do GPUs within a single node connect with each other
How are GPU nodes connected

Credits to Stanford CS 336 lectures, and I wrote this article with the assistance of Google Gemini to help me fill in the knowledge gaps whenever I lack the pre-requisite knowledge to fully understand the lecture materials.

Optimization goal of GPUs

GPUs are optimised for parallel processing and high-throughput computing, making them highly efficient at handling a massive amount of data simultaneously. This contrasts with CPUs, which are primarily optimized for sequential processing and handling a wide variety of tasks with lower latency.

In the context of machine learning, we mainly use GPUs for matrix multiplications. The goal is for matrix multiplications to be compute-bound, and this is achieved with high FLOPs/bytes (see roofline model below). When the GPUs are compute bound, this means the CUDA cores are utilized to their maximum, i.e. getting the most performance out of the GPUs.

The "roofline" represents the maximum achievable performance for a given hardware platform. The slanted "roof" indicates the program is memory-bound, i.e processing is limited by how fast data can be moved from memory to processor. A flat "ceiling" represents the peak theoretical performance of the processor's Arithmetic Logic Units, or CUDA cores in the context of Nvidia GPUs. This indicates the program is limited by the raw computational power of the processor.

Source: Nvidia Nsight

Operational intensity: defined as the ratio of total number of floating-point operations (FLOPs) performed to the number of bytes of data moved from the memory. A higher operational intensity means the algorithm performs many calculations for each piece of data it fetches, usually happens with dense matrix multiplications. Conversely, for sparse matrices with many zero elements, computations involving these zeroes are often skipped, resulting in fewer calculations per unit of data read from memory.
Throughput: measures the performance of the hardware when running a specific program.

Key concepts of GPUs - software and hardware

Term	Nature	Controlled By	Description	Key Relationship	Analogy
Grid	Software (Logical)	Programmer	A collection of all blocks for a single kernel launch. Represents the entire problem space.	A grid is executed by all available SMs on the GPU.	The entire workforce for a massive project.
Streaming Multiprocessor (SM)	Hardware (Physical)	Hardware	A physical processing unit on the GPU chip. It contains CUDA cores, shared memory, schedulers, and registers.	An SM executes one or more blocks concurrently.	A factory floor or a workshop.
Block	Software (Logical)	Programmer	A group of threads (up to 1024) designed to cooperate.	A block is always executed by a single SM. It is never split.	A team of workers assigned to a specific task on one factory floor.
Warp	Hardware & Software Interaction	Hardware	A group of 32 threads that are scheduled and executed in lockstep (SIMT). The fundamental unit of scheduling.	An SM breaks a block down into warps and schedules them for execution.	A small group of 32 workers given the same instruction by a foreman.
Thread	Software (Logical)	Programmer	The most basic unit of execution. A single instance of the kernel function operating on its own private data.	A thread is part of a warp, which is executed by a CUDA core.	An individual worker.
CUDA Core	Hardware (Physical)	Hardware	The fundamental arithmetic logic unit (ALU) on the GPU. It performs the actual math (add, multiply, etc.).	A CUDA core executes instructions from one thread at any given clock cycle. It rapidly switches between many threads over time.	A single tool or machine on the factory floor used by a worker.

Note: SIMT means single instruction, multiple threads. It is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with zero-overhead multithreading, i.e. multithreading where the hardware is capable of switching between threads on a cycle-by-cycle basis.

Source: Demystifying GPU Compute Architectures

Deeper dive into a warp

Below is a diagram of a warp from Nvidia’s Hopper H100 Architecture White Paper.

Source: Demystifying GPU Compute Architectures

We can see there is one "Warp Scheduler (32 thread/clk)" block visible at the top of the processing units. "Warp Scheduler (32 thread/clk)" means that the Warp Scheduler can dispatch or schedule 32 threads per clock cycle. This indicates the throughput capacity of the scheduler, allowing it to issue instructions for a full warp (32 threads) every clock cycle.

"Clk" is an abbreviation for clock cycle (or clock). In the context of a processor like a GPU, a clock cycle is the basic unit of time that synchronizes the operations of the internal components. The clock speed (measured in MHz or GHz) determines how many clock cycles occur per second.

The INT32, FP32, and FP64 counts denote the physical hardware units available to perform operations on these data types for the threads in the warp. Each of these unit is a CUDA core. If there are fewer units than 32 for a specific operation, the operation for the entire warp will be "serialized" over multiple clock cycles on the available units. A single warp cannot "mix and match" FP32, FP64, and INT32 operations simultaneously in a single clock cycle.

How are tensor cores different from CUDA cores?

Tensor Cores and CUDA Cores are both processing units within NVIDIA GPUs, but they are designed for different types of computational tasks.

CUDA Cores: These are general-purpose parallel processors. They handle a wide range of computational tasks, including graphics rendering, physics simulations, video processing, and general compute operations. They primarily operate on single-precision (FP32) and double-precision (FP64) floating-point numbers, making them versatile for a broad spectrum of computations.
Tensor Cores: These are specialized hardware units optimized specifically for accelerating tensor operations, particularly matrix multiplications and fused multiply-add (FMA) operations, which are fundamental to deep learning and high-performance computing. They are designed for mixed-precision computing, supporting half-precision (FP16) and integer (INT8) calculations, which allows for faster training and less memory consumption in AI models without significant accuracy loss.

There is no need to explicitly instruct Pytorch to use the tensor cores for matrix multiplications. Instead, Pytorch uses libraries such as CuBLAS and CuDNN which automatically utilize tensor cores provided the following conditions are met:

GPU supports Tensor Cores.
Data uses appropriate data types, i.e. Mixed Precision.
The operations are conducive to Tensor Cores, i.e. matrix multiplications (e.g., torch.matmul, linear layers, convolutions) and fused multiply-add operations. Element-wise operations, activations, etc., are typically handled by CUDA Cores.

Mapping software to hardware

The programmer defines the Grid, Block, and Thread structure. The GPU hardware maps this structure onto its physical SMs and CUDA Cores, using the Warp as the fundamental scheduling mechanism to keep the hardware busy and hide memory latency.

Programmer's Logical View:

Grid {
  Block_1 { Thread_1, Thread_2, ..., Thread_1024 }
  Block_2 { Thread_1, Thread_2, ..., Thread_1024 }
  ...
}

GPU's Physical Execution View:

GPU {
  SM_1 {
    - Schedules Warp_A, Warp_B, ... (from Block_1)
    - Its [CUDA Cores] execute instructions for threads in those warps
    - Has a pool of Shared Memory for Block_1
  }
  SM_2 {
    - Schedules Warp_X, Warp_Y, ... (from Block_2)
    - Its [CUDA Cores] execute instructions for threads in those warps
    - Has a pool of Shared Memory for Block_2
  }
  ...
}

Why do warps exist?

As fetching and decoding instructions are complex tasks, GPU made the trade-off of simplifying control logic for more cores by having each control logic applied to sets of 32 CUDA cores. This reduces the complexity and power consumption of the control logic.

As a result, programmer needs to design code carefully to maximise the use of all 32 cores in a warp. For example, in the following if-else statements, at any point of time only 16 cores of a warp is utilised at any point of time, which results in 50% utilization rate.

// All 32 threads in a warp execute this code at the same time
if (threadIdx.x < 16) {
    // Path A: Executed by the first 16 threads
    do_something(); 
} else {
    // Path B: Executed by the last 16 threads
    do_something_else();
}

Below is a simple diagram representing how using if-else statements results in 50% utilization of a warp.

Why is a block restricted to running on a single SM?

Shared memory access: Shared memory and L1 cache are order of 10x faster than L2 cache and global memory. A block can take advantage of the speed of these fast memory as both the shared memory and L1 cache are inside the SM. The difference in memory cycle leads to implementations that take full advantage of the faster shared memory and L1 cache to speed up computation, e.g. tiling in matrix multiplications, which involves loading as much useful data as possible onto shared memory, maximizing the computation possible on shared memory, before loading new data onto shared memory. This reduces the number of reads from global memory, which is slow.

Memory type	Shared memory	L1 cache	L2 cache	Global memory
CPI (cycles)	load: 23 store: 19	33	200	290

Source: Optimizing your GPU infrastructure

Thread synchronization: The barrier synchronization _syncthreads() is managed at the hardware-level by SMs, and having a block extends across multiple SMs will significantly slow down synchronization.
Efficient scheduling: By confining a block to a single SM, it is easier for the GPU's main scheduler to simply find a SM with enough resources to launch the entire block, thus avoiding the work needed to divide a block's resources and threads across multiple hardware units.

What is the difference between L1 cache and shared memory?

On modern Nvidia GPUs, both L1 cache and shared memory exists on a single, unified block of SRAM in a SM. L1 cache and shared memory are dynamically partitioned on a per-kernel basis. A programmer can set cudaFuncCachePreferShared if the kernel relies more heavily on inter-thread communication, and cudaFuncCachePreferL1 if the kernel has scattered global memory access patterns.

Source: Basic memory hierarchy of an NVIDIA A100 40GB GPU

What if a single block cannot fit within a single SM?

The CUDA runtime will return an error and the kernel will not run. A block must respect the following contract with the GPU hardware:

Maximum threads per block: e.g. a modern Nvidia GPU allows for a maximum of 1024 threads per block.
Shared memory per block: each SM has a fixed amount of physical shared memory, e.g. 64 KB, 96 KB.
Registers per block: Each SM has a finite number of physical registers, e.g. 65,546 32-bit registers. Every thread in a block requires a certain number of registers, determined by the compiler. The total registers a block needs is (threads per block) * (registers per thread).

Scaling compute across GPUs

To support intensive parallel processing, GPU nodes are interconnected in the following setup to achieve high-speed data transfer both within node (intra-node) and between nodes (inter-nodes).

Source: CS 336, lecture on parallelism 1, 9 minute 14 seconds

In short, nodes in a high-performance NVIDIA GPU cluster utilize two distinct, parallel networking layers. For the most intensive GPU-to-GPU communication, they are connected over an NVLink fabric, which is managed and scaled by NVSwitch. This high-bandwidth fabric operates both intra-node (connecting all GPUs within a single server) and, in advanced systems, inter-node, allowing NVSwitch to link dozens of nodes to form a single, seamless, data-center-sized GPU compute domain.

Concurrently, these same nodes use Host Channel Adapters (HCAs) to connect to a broader cluster network, such as InfiniBand or Ethernet. This second network layer handles all other inter-node communication, including connecting to storage systems and scaling the cluster to hundreds or thousands of nodes, well beyond what a single NVLink fabric can encompass.

Source: Nvidia Hopper architecture in depth

Component / Interconnect	Full Name	Purpose	Differentiation from Other Components
CPU (0 & 1)	Central Processing Unit	The "brain" of the server, responsible for general-purpose computation, executing operating system instructions, and managing system resources. In this architecture, they manage the overall workflow and data distribution to the specialized processors (GPUs).	CPUs are designed for a wide range of tasks. They differ from GPUs, which are highly specialized for parallel computations. There are two CPUs in this diagram to provide more processing power and I/O capabilities.
GPU (0-7)	Graphics Processing Unit	Specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. In this context, they are used for general-purpose computing on graphics processing units (GPGPU) to accelerate highly parallel tasks like deep learning and scientific simulations.	GPUs are massively parallel processors with thousands of cores, making them far more efficient than CPUs for tasks that can be broken down into many simultaneous operations. This diagram features eight GPUs, highlighting its focus on parallel computing.
PLX	PLX Technology (a brand of PCIe switches)	These are PCI Express (PCIe) switches. Their purpose is to expand the number of available PCIe lanes from the CPU, allowing a single CPU to connect to multiple high-bandwidth devices like GPUs and HCAs simultaneously.	Unlike a simple bus, a switch provides dedicated point-to-point connections, reducing bottlenecks. The PLX switches in the diagram aggregate the connections from the GPUs and HCAs before connecting to the CPUs.
HCA (0-3)	Host Channel Adapter	An HCA is the network interface card for InfiniBand, a high-performance networking technology. Its purpose is to connect the server node to an external InfiniBand network (fabric), enabling high-speed, low-latency communication with other server nodes for high-speed inter-node parallelism.	While a standard Ethernet NIC connects to a traditional network, an HCA is specifically for the high-throughput and low-latency InfiniBand network, which is common in supercomputing and large-scale AI clusters.
NVSwitch (0-5)	NVIDIA NVSwitch	A high-speed switch developed by NVIDIA that enables all-to-all, non-blocking communication between multiple GPUs using the NVLink protocol. This creates a unified memory space across the GPUs, allowing them to work together as a single, powerful processor with very high inter-GPU bandwidth.	NVSwitch is distinct from PCIe switches (like PLX) as it uses the much faster and more efficient NVLink protocol specifically for GPU-to-GPU communication. It provides significantly higher bandwidth and lower latency than what is achievable over PCIe.
HDR InfiniBand	High Data Rate InfiniBand	A high-speed interconnect standard used for communication between server nodes (inter-node parallelism). In the diagram, it facilitates data transfer between the server and other parts of the cluster via the HCAs, with a speed of 50 GT/s per lane.	HDR InfiniBand offers much lower latency and higher bandwidth compared to traditional Ethernet, making it ideal for HPC and AI workloads where fast inter-node communication is critical.
PCI Express 4.0	Peripheral Component Interconnect Express 4.0	A high-speed serial computer expansion bus standard. It is used to connect the CPUs to the PLX switches and subsequently to the GPUs and HCAs. It provides the primary data pathway for components that are not connected via NVLink, operating here at 16 GT/s per lane.	PCIe is a general-purpose interconnect, whereas NVLink is specialized for GPU communication. PCIe 4.0 offers double the bandwidth of its predecessor, PCIe 3.0.
xGMI-2	Socket to Socket Global Memory Interconnect 2	A high-speed interconnect developed by AMD for direct communication between two CPU sockets. It allows for fast and coherent memory access between the two CPUs, which is essential for dual-socket server performance. The diagram shows it running at 16 GT/s per lane.	xGMI is specific to AMD CPUs for inter-socket communication. Intel uses a similar technology called Ultra Path Interconnect (UPI). It provides a dedicated, high-speed link between CPUs.
NVLink 3.0	NVIDIA NVLink 3.0	A proprietary high-speed, point-to-point interconnect developed by NVIDIA for connecting GPUs. In this diagram, it connects the GPUs through the NVSwitches, enabling extremely fast data sharing between them at a rate of 400 GT/s per lane.	NVLink 3.0 provides significantly higher bandwidth than PCIe 4.0, which is crucial for large-scale AI model training and other data-intensive GPU workloads. It is the backbone of intra-node GPU parallelism in this architecture.

Explanation of notations 16x and 4x

The notations 16x and 4x refer to the number of lanes in the respective interconnects (PCIe, xGMI).

Lane: A lane is a set of differential wire pairs, with one pair for transmitting data and the other for receiving. This allows for full-duplex communication, meaning data can be sent and received simultaneously.
Bandwidth Scaling: The total bandwidth of an interconnect is directly proportional to the number of lanes it has. A 16x connection has 16 lanes and therefore provides 16 times the bandwidth of a 1x connection.

Source: Wikipedia

In the context of the diagram:

16x: Indicates a connection with 16 lanes. This is a high-bandwidth connection typically used for demanding components like GPUs and the links between CPUs and PCIe switches. For example, a PCIe 4.0 x16 connection offers a theoretical bandwidth of 32 GB/s.
4x: Indicates a connection with 4 lanes. While providing less bandwidth than a 16x connection, it is still a high-speed link suitable for devices like the Host Channel Adapters (HCAs) shown in the diagram. A PCIe 4.0 x4 connection provides a theoretical bandwidth of 8 GB/s.

Lewis Won @lewis_won