Occupancy Equation
Occupancy --- the number of concurrent waves a compute unit can sustain --- is governed by a single equation that applies uniformly across all four target GPU architectures.
The Equation
Section titled “The Equation”O = floor(F / (R * W * w))Where:
| Symbol | Meaning | Unit |
|---|---|---|
| O | Occupancy (concurrent waves per compute unit) | waves |
| F | Register file size per compute unit | bytes |
| R | Registers used per thread | registers |
| W | Wave width (threads per wave) | threads |
| w | Register width | bytes |
The floor function reflects the fact that partial waves cannot be scheduled: you either have enough registers for an entire wave or you do not.
Why This Equation Matters
Section titled “Why This Equation Matters”Occupancy directly determines a GPU’s ability to hide memory latency. When one wave stalls on a memory access, the hardware switches to another wave. Higher occupancy means more waves available for latency hiding, which generally means higher throughput for memory-bound kernels.
The equation reveals a fundamental tension: more registers per thread means fewer concurrent waves. A kernel that uses 64 registers achieves half the occupancy of one that uses 32 registers, all else being equal. The WAVE compiler’s register allocator must balance register pressure against occupancy.
Vendor Calculations
Section titled “Vendor Calculations”NVIDIA (Ampere SM)
Section titled “NVIDIA (Ampere SM)”F = 256 KB = 262,144 bytesR = 255 (maximum registers per thread)W = 32 (warp width)w = 4 bytes (32-bit registers)
O = floor(262,144 / (255 * 32 * 4)) = floor(262,144 / 32,640) = floor(8.03) = 8 waves/SM (theoretical max)With typical register usage of 32 registers:
O = floor(262,144 / (32 * 32 * 4)) = floor(64) = 64 warps/SMIn practice, NVIDIA Ampere SMs support up to 64 warps, so register pressure is rarely the only bottleneck. At maximum register usage (255 registers), occupancy drops to approximately 7 warps.
AMD (RDNA 3 CU)
Section titled “AMD (RDNA 3 CU)”F = 512 KB = 524,288 bytesR = 256 (maximum VGPRs per thread)W = 64 (wavefront width, Wave64 mode)w = 4 bytes
O = floor(524,288 / (256 * 64 * 4)) = floor(524,288 / 65,536) = floor(8.0) = 8 wavefronts/CU (at maximum register usage)AMD’s larger register file compensates for the wider wave width. In Wave32 mode (W=32), occupancy doubles.
Intel (Xe-HPG EU)
Section titled “Intel (Xe-HPG EU)”F = 128 KB = 131,072 bytesR = 128 (maximum GRF registers per thread)W = 16 (sub-group width, SIMD16)w = 4 bytes
O = floor(131,072 / (128 * 16 * 4)) = floor(131,072 / 8,192) = floor(16.0) = 16 threads/EU (at maximum register usage)Intel’s narrower sub-groups and smaller register file per EU yield high occupancy per EU, though each EU is smaller than an NVIDIA SM or AMD CU.
Apple (M1 GPU Core)
Section titled “Apple (M1 GPU Core)”F = 208 KB = 212,992 bytesR = 128 (maximum registers per thread)W = 32 (SIMD-group width)w = 4 bytes
O = floor(212,992 / (128 * 32 * 4)) = floor(212,992 / 16,384) = floor(13.0) = 13 SIMD-groups (at maximum register usage)Apple’s GPU achieves relatively high occupancy even at maximum register pressure, reflecting a design that favors thread-level parallelism.
Implications for WAVE
Section titled “Implications for WAVE”Register allocation strategy
Section titled “Register allocation strategy”The WAVE compiler can target a specific occupancy by capping register usage. If a backend specifies that a kernel should achieve at least O=4 waves, the compiler can compute the maximum register budget:
R_max = floor(F / (O_target * W * w))Registers beyond this budget are spilled to local memory.
Runtime queries
Section titled “Runtime queries”Because W, F, and w vary across (and sometimes within) architectures, WAVE does not hardcode these values. The runtime provides query functions that return the target’s parameters, and the compiler uses them to make occupancy-aware decisions. This is an instance of the Thin Abstraction principle: define what to optimize for, not what the hardware parameters are.
Cross-vendor portability
Section titled “Cross-vendor portability”The same kernel binary, with the same register count, will achieve different occupancy on different GPUs. This is expected and correct --- the equation shows that occupancy is a function of hardware parameters that WAVE cannot and should not try to normalize. Instead, WAVE ensures that the register count is recorded in the WBIN metadata (see Binary Encoding), allowing the runtime to compute occupancy and make dispatch decisions accordingly.