Control Flow
WAVE uses structured control flow with per-wave divergence stacks to manage lane divergence, avoiding the complexity of arbitrary jump targets while remaining expressive enough for real GPU workloads.
The Divergence Problem
Section titled “The Divergence Problem”GPU threads execute in lockstep groups (waves). When threads within a wave take different branches of an if statement, the hardware must track which lanes are active and which are masked off. This is divergence, and every GPU vendor handles it differently at the hardware level.
The challenge for a portable ISA: define divergence behavior precisely enough that programs are correct on all targets, without prescribing a mechanism that only one vendor implements.
Structured Control Flow
Section titled “Structured Control Flow”WAVE mandates structured control flow constructs rather than arbitrary branch/jump instructions:
if p0 ; push active mask, mask lanes where p0 is false ; ... then-body (only p0-true lanes execute)else ; flip mask ; ... else-body (only p0-false lanes execute)endif ; pop mask, restore all lanes
loop ; push loop mask ; ... loop body break p1 ; lanes where p1 is true exit the loopendloop ; branch back to loop header for remaining lanesWhy structured, not arbitrary jumps?
Section titled “Why structured, not arbitrary jumps?”-
All four vendors use structured divergence internally. Even NVIDIA’s per-thread program counter reconverges at structured control flow boundaries. Arbitrary jumps would require a reconvergence analysis that the hardware already performs for structured constructs.
-
Correctness is provable. Structured control flow guarantees that every divergent region has a well-defined reconvergence point. With arbitrary jumps, the backend would need to insert reconvergence barriers --- a solved but fragile problem.
-
Backend mapping is direct. Each structured construct maps to a small, well-understood sequence on every target (see Backend Mapping).
Per-Wave Divergence Stacks
Section titled “Per-Wave Divergence Stacks”Each wave maintains its own divergence stack. The stack tracks:
- Active mask: which lanes are currently executing.
- Reconvergence point: where masked-off lanes will rejoin.
When an if is encountered, the current active mask is pushed onto the stack, and the new mask reflects the predicate evaluation. At else, the mask is complemented (relative to the pushed mask). At endif, the original mask is popped and restored.
The Shared-Stack Defect (v0.1)
Section titled “The Shared-Stack Defect (v0.1)”The v0.1 specification used a single divergence stack shared across all waves in a workgroup. This caused a deadlock scenario:
- Wave A pushes a mask and enters a divergent region.
- Wave B, at a different program counter, hits a barrier and waits.
- Wave A also reaches the barrier, but the shared stack now contains stale data from Wave B’s earlier divergence.
- Neither wave can proceed --- the stack state is corrupted.
The fix was straightforward: each wave gets its own divergence stack. This matches every vendor’s implementation (see below) and eliminates cross-wave interference entirely. Full details in Spec Defects.
Vendor Divergence Mechanisms
Section titled “Vendor Divergence Mechanisms”| Vendor | Mechanism | WAVE Mapping |
|---|---|---|
| NVIDIA | Per-thread program counter with convergence barriers (bra.uni, bar.sync). Hardware tracks per-thread PCs and reconverges at sync points. | Structured constructs map to predicated branches with implicit reconvergence. |
| AMD | Compiler-managed EXEC mask. The s_cbranch instructions modify EXEC; the compiler inserts s_or_b64 to restore masks. | if/else/endif map directly to EXEC mask manipulation sequences. |
| Intel | Predicated SIMD execution. The EU evaluates all lanes but applies a channel enable mask per instruction. | Structured constructs map to predicate register updates and predicated instruction sequences. |
| Apple | Hardware divergence stack stored in register r0l. The GPU pushes/pops masks automatically for structured constructs. | Near 1:1 mapping --- Apple’s hardware natively supports the same structured model. |
Predicated Execution
Section titled “Predicated Execution”WAVE provides four predicate registers (p0—p3) selected by the 2-bit predicate field in the instruction encoding. The pred_neg bit inverts the condition. Any instruction can be predicated:
cmp.lt p0, r0, r1 ; set p0 where r0 < r1if p0 iadd r2, r2, r3 ; only executes in lanes where r0 < r1endifPredicates interact with the divergence stack: an if statement evaluates its predicate and intersects the result with the current active mask to produce the new mask.
Nesting
Section titled “Nesting”Structured constructs can nest to arbitrary depth. Each nesting level pushes one entry onto the divergence stack. In practice, kernel divergence depth rarely exceeds 4—6 levels, and the stack is bounded by implementation-defined limits exposed through the WAVE runtime query interface.