WAVE Specification
The WAVE specification defines the instruction set architecture, binary encoding format, memory model, and execution semantics for the WAVE GPU compute model.
Specification Versions
Section titled “Specification Versions”v0.4 — Current
Section titled “v0.4 — Current”The v0.4 specification is the current version of the WAVE ISA. It fixes predicate encoding (Defect 4), restructures word0 bits [3:0], and adds neural network training as conformance evidence.
Key changes from v0.3:
- Predicate encoding restored. Bits [3:0] of word0 repurposed for predicate fields: pred_reg (bits [1:0]), pred_neg (bit [2]), with bit [3] reserved. Previously these bits held scope and flags, which silently dropped all predication.
- Scope moved to word1. Memory scope is now encoded in word1 bits [1:0] for extended instructions (DeviceAtomic, fence). This freed word0 bits for predicate encoding.
- New opcode: Misc (0x41). Misc operations (mov, mov_imm, mov_sr) moved from Control (0x3F) to their own opcode, since the flags field they used for dispatch no longer exists.
- SyncOp modifier offset. Sync operations (return, halt, barrier, fence) share Control opcode 0x3F but use modifier values offset by +8 (SYNC_MODIFIER_OFFSET = 8).
- Type conversion naming convention clarified. The mnemonic
cvt_A_Bconverts from type B to type A (destination type first). An emulator bug wherecvt_f32_i32andcvt_i32_f32were swapped has been fixed. - Neural network training verification. A complete two-layer MNIST network (784->128->10) has been trained using 11 WAVE kernels, with gradients verified against PyTorch.
The v0.3 specification resolved encoding issues present in earlier versions and established the canonical instruction format used by all WAVE tooling.
Key changes from v0.2:
- Modifier field widened to 4 bits. The 3-bit modifier field from v0.2 was insufficient to encode all instruction variants. v0.3 expands it to 4 bits, supporting up to 16 modifier values per opcode.
- Flags field reduced from 3 bits to 2 bits. Two flags were removed:
WAVE_REDUCE_FLAG— now handled by dedicated wave operation opcodes.NON_RETURNING_ATOMIC_FLAG— now encoded via the modifier field on atomic instructions.
- Canonical register encoding: 5-bit register fields encoding 32 general-purpose registers (
r0–r31). - 4-bit modifier enables fine-grained instruction variants (e.g., memory ordering, rounding mode, comparison predicate).
Instruction word layout (v0.3, 32-bit base format):
| Bits | Field | Width |
|---|---|---|
| 31–24 | Opcode | 8 |
| 23–20 | Modifier | 4 |
| 19–18 | Flags | 2 |
| 17–13 | Dst (rd) | 5 |
| 12–8 | Src1 (rs1) | 5 |
| 7–3 | Src2 (rs2) | 5 |
| 2–0 | Reserved | 3 |
Extended format instructions append a second 32-bit immediate word.
v0.2 fixed the most critical encoding bug from v0.1 — register fields — but still carried an undersized modifier field.
Key changes from v0.1:
- Register fields widened to 8 bits. v0.1 used 5-bit register fields but attempted to address more than 32 registers in some instructions, causing encoding collisions. v0.2 moved to 8-bit register fields, supporting up to 256 registers.
- Per-wave control flow. v0.1 shared a single control flow state across all waves in a workgroup. v0.2 introduced per-wave program counters and divergence masks, enabling proper SIMT execution with independent branching per wave.
- 3-bit modifier (unchanged from v0.1). This was later identified as a limitation and corrected in v0.3.
Known issues (fixed in v0.3):
- The 3-bit modifier field could only encode 8 variants, which was insufficient for memory ordering modes, comparison predicates, and rounding modes.
WAVE_REDUCE_FLAGandNON_RETURNING_ATOMIC_FLAGconsumed flag bits that could be better allocated.- 8-bit register fields were unnecessarily wide — no real workload needed more than 32 registers per thread.
v0.1 — Initial Draft
Section titled “v0.1 — Initial Draft”The initial specification established the core instruction set and execution model. It was a proof-of-concept encoding that validated the overall architecture but contained several encoding issues.
Characteristics:
- 5-bit register fields encoding 32 general-purpose registers.
- 3-bit modifier field for instruction variants.
- 3-bit flags field including
WAVE_REDUCE_FLAGandNON_RETURNING_ATOMIC_FLAG. - Shared control flow state. All waves in a workgroup shared a single program counter and divergence mask. This meant branch divergence in one wave could stall the entire workgroup.
Known issues (fixed in v0.2):
- Shared control flow prevented efficient SIMT divergence handling.
- Register encoding was correct at 5 bits, but v0.2 mistakenly widened it to 8 bits before v0.3 restored the 5-bit width.
Specification Contents
Section titled “Specification Contents”Each version of the specification covers the following sections:
- Introduction — purpose, scope, design principles, and relationship to other standards.
- Execution Model — thread hierarchy, identifiers, core resources, execution guarantees, and dispatch.
- Register Model — general-purpose registers, sub-register access, register pairs, special registers, and predicate registers.
- Memory Model — memory spaces, local/device memory details, memory ordering, and atomic operations.
- Control Flow — structured control flow, uniform branches, divergence/reconvergence, and per-wave state.
- Instruction Set — integer, bitwise, floating-point, type conversion, comparison, memory, atomic, wave, synchronization, control flow, and MMA instructions.
- Capability System — required constants, optional capabilities, MMA parameters, and query mechanism.
- Binary Encoding — base and extended instruction formats, opcode map.
- Conformance — required behavior, implementation-defined behavior, undefined behavior, and conformance testing.