Skip to content

Binary Encoding

The WAVE binary encoding packs every instruction into a fixed 32-bit word, with an optional second word for extended operations, balancing decode simplicity against expressiveness.

Every WAVE instruction starts with a 32-bit word with the following bit layout:

BitsFieldWidthPurpose
31:24opcode8 bitsOperation class
23:16RD8 bitsDestination register (0—255)
15:8RS18 bitsSource register 1 (0—255)
7:4modifier4 bitsOperation variant within opcode class (0—15)
3reserved1 bitReserved, must be zero
2pred_neg1 bitNegate predicate condition
1:0pred_reg2 bitsPredicate register index (0=p0, 1=p1, 2=p2, 3=p3)
31 24 23 16 15 8 7 4 3 2 1 0
┌────────┬────────┬────────┬──────┬───┬───┬────┐
│ opcode │ RD │ RS1 │ mod │rsv│neg│pred│
│ 8b │ 8b │ 8b │ 4b │1b │1b │ 2b │
└────────┴────────┴────────┴──────┴───┴───┴────┘

When pred_reg=0 and pred_neg=0, the instruction is unconditional. When pred_reg is nonzero, the instruction executes only if the specified predicate register is true (or false, if pred_neg=1). For example, @p1 sets pred_reg=1, pred_neg=0; @!p2 sets pred_reg=2, pred_neg=1.

This encoding was introduced in v0.4. Earlier versions used bits [3:0] for scope and flags, which silently dropped all predication. See Spec Defects for the full history (Defect 4).

Eight bits address 256 registers per thread. In practice, most kernels use far fewer, but the wide field simplifies encoding and avoids the v0.1 mismatch between spec text and register field width. Two 8-bit register fields (rd, rs1) fit in word0; additional source registers (rs2, rs3, rs4) are encoded in word1.

The modifier field disambiguates variants within an opcode class. For example, the FUnaryOp opcode class uses the modifier to select between frsqrt (0), frcp (1), … fsin (8), fcos (9), fexp2 (10), flog2 (11). Four bits encode values 0—15, covering all current variants with room to grow. The earlier 3-bit modifier could only encode 0—7, which caused a concrete encoding failure documented in Modifier Field Evolution.

The Control opcode (0x3F) uses modifier values 0—7 for ControlOp (if, else, endif, loop, break, continue, endloop, call) and values 8—15 for SyncOp (return, halt, barrier, fence variants, wait, nop), offset by SYNC_MODIFIER_OFFSET = 8.

When an instruction needs additional source registers, a memory scope, or an inline immediate, it uses a second 32-bit word. The opcode in word0 determines whether word1 is present.

BitsFieldWidthPurpose
31:24RS28 bitsSource register 2 (0—255)
23:16RS38 bitsSource register 3 (0—255)
15:8RS48 bitsSource register 4 (0—255)
7:2reserved6 bitsReserved, must be zero
1:0scope2 bitsMemory scope (00=wave, 01=workgroup, 10=device, 11=system)
31 24 23 16 15 8 7 2 1 0
┌────────┬────────┬────────┬──────────┬────┐
│ RS2 │ RS3 │ RS4 │ reserved │scop│
│ 8b │ 8b │ 8b │ 6b │ 2b │
└────────┴────────┴────────┴──────────┴────┘

Alternatively, the entire word1 can serve as a 32-bit immediate value (e.g., for mov_imm). The opcode determines interpretation.

The 2-bit scope field in word1 encodes memory ordering visibility for scoped instructions (DeviceAtomic, fence):

ValueScopeMeaning
00WaveVisible within the executing wave
01WorkgroupVisible to all waves in the workgroup
10DeviceVisible to all waves on the device
11SystemVisible to the device and host CPU

In v0.3, scope was encoded in word0. It was moved to word1 in v0.4 to free bits for predicate encoding.

  • Two-source operations (iadd r0, r1, r2): needs RS1 (word0) and RS2 (word1).
  • Fused multiply-add (fma r0, r1, r2, r3): needs RS1, RS2, and RS3.
  • Immediate loads (mov_imm r0, 0x3F800000): the 32-bit immediate occupies the entire word1.
  • Scoped atomics (device_atomic_add r0, r1, r2, workgroup): scope in word1 bits [1:0].
  • Atomic compare-and-swap: needs address, expected, desired, and result registers.

Compiled WAVE programs are distributed as .wbin (WAVE Binary) files. The container is designed for direct memory-mapping without parsing overhead.

OffsetSizeFieldDescription
04magicASCII "WAVE" (0x57415645)
44versionContainer format version
84code_offsetByte offset to code section
124code_sizeSize of code section in bytes
164symbol_offsetByte offset to symbol table
204symbol_sizeSize of symbol table in bytes
244metadata_offsetByte offset to metadata section
284metadata_sizeSize of metadata section in bytes

Each kernel entry in the metadata section describes the resource requirements that a backend needs for dispatch:

FieldPurpose
nameKernel identifier (null-terminated UTF-8)
register_countNumber of registers used per thread
local_mem_sizeBytes of local (workgroup-shared) memory required
workgroup_size_xWorkgroup dimension X
workgroup_size_yWorkgroup dimension Y
workgroup_size_zWorkgroup dimension Z
code_offsetByte offset of this kernel’s code within the code section
code_sizeSize of this kernel’s code in bytes

The WBIN format is intentionally minimal. It carries exactly the information a backend runtime needs to allocate resources and dispatch kernels. There is no debug info, no relocations, and no linking metadata in the base format --- these are relegated to optional extension sections identified by the metadata table. This keeps the critical path (load, allocate, dispatch) as fast as possible.