Skip to content

WAVE Specification v0.1

Version: 0.1 (Working Draft) Authors: Ojima Abraham, Onyinye Okoli Date: March 22, 2026 Status: Working Draft

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

This specification defines WAVE (Wide Architecture Virtual Encoding), a vendor-neutral instruction set architecture for general-purpose GPU computation. It specifies an abstract execution model, a register model, a memory model, structured control flow semantics, an instruction set, and a capability query system.

The specification follows the thin abstraction principle: it defines what a compliant implementation MUST be able to do, not how it must do it. Implementations MAY use any microarchitectural technique to achieve compliance.

This specification covers general-purpose compute workloads. Graphics pipeline operations (rasterization, tessellation, pixel export, ray tracing) are out of scope and MAY be addressed by future extensions.

  1. Thin abstraction. Every requirement traces to a hardware-invariant primitive observed across all four major GPU vendors. No requirement is imposed for software convenience alone.
  2. Queryable parameters. Values that differ across implementations (wave width, register count, scratchpad size) are exposed as queryable constants, not fixed in the specification.
  3. Structured divergence. The specification defines control flow semantics but not divergence mechanisms. Implementations are free to use any technique (execution masks, predication, hardware stacks, per-thread program counters) to achieve the specified behavior.
  4. Mandatory minimums. Every queryable parameter has a minimum value. A compliant implementation MUST meet or exceed all minimums.

This specification is complementary to existing standards. SPIR-V MAY be used as a distribution format for programs targeting this ISA. OpenCL and Vulkan MAY serve as host APIs for dispatching workloads. The distinction is that this specification defines the hardware execution model, while existing standards define host-device interaction.

A compliant processor consists of one or more Cores. Each Core is an independent compute unit capable of executing multiple Workgroups concurrently. Cores are not addressable by software. The hardware assigns Workgroups to Cores, and the programmer MUST NOT assume any particular mapping.

The execution model defines four levels, three mandatory and one optional:

Level 0: Thread. The smallest unit of execution. A Thread has a private register file, a scalar program counter, and a position within the hierarchy identified by hardware-populated identity values. A Thread executes a sequential stream of instructions.

Level 1: Wave. A group of exactly W Threads that execute a single instruction simultaneously, where W is a hardware constant queryable at compile time (see Section 7). All Threads in a Wave share a program counter for the purpose of instruction fetch. When Threads in a Wave disagree on a branch condition, the implementation MUST ensure that both paths execute with inactive Threads producing no architectural side effects. The mechanism by which this is achieved is not specified.

A Wave is the fundamental scheduling unit. The hardware scheduler operates on Waves, not individual Threads.

Level 2: Workgroup. A group of one or more Waves, containing up to MAX_WORKGROUP_SIZE Threads. All Waves in a Workgroup execute on the same Core, share access to Local Memory, and may synchronize via Barriers.

The number of Waves per Workgroup is ceil(workgroup_thread_count / W).

Workgroup dimensions are specified at dispatch time as a 3-dimensional size (x, y, z) where x * y * z <= MAX_WORKGROUP_SIZE.

Level 3: Grid. The complete dispatch of Workgroups. A Grid is specified as a 3-dimensional count of Workgroups. Workgroups within a Grid MAY execute in any order, on any Core, at any time. No synchronization is available between Workgroups within a single Grid dispatch.

Level 2.5 (Optional): Cluster. A group of Workgroups guaranteed to execute concurrently on adjacent Cores with access to each other’s Local Memory. The Cluster size is queryable via CLUSTER_SIZE. If the implementation does not support Clusters, CLUSTER_SIZE is 1 and Cluster-scope operations behave identically to Workgroup-scope operations.

Every Thread has the following hardware-populated, read-only values available as special registers:

IdentifierTypeDescription
thread_id.{x,y,z}uint32Thread position within Workgroup (3D)
wave_iduint32Wave index within Workgroup
lane_iduint32Thread position within Wave (0 to W-1)
workgroup_id.{x,y,z}uint32Workgroup position within Grid (3D)
workgroup_size.{x,y,z}uint32Workgroup dimensions
grid_size.{x,y,z}uint32Grid dimensions (in Workgroups)
num_wavesuint32Number of Waves in this Workgroup

Each Core provides the following resources:

Register File. A fixed-size on-chip storage of F bytes, partitioned among all simultaneously resident Threads. Each Thread receives R registers (declared at compile time). The maximum number of simultaneously resident Waves is bounded by:

max_resident_waves = floor(F / (R * W * 4))

where 4 is the register width in bytes (32 bits). This is the occupancy equation. Implementations MUST support at least MIN_MAX_REGISTERS registers per Thread.

Local Memory. A fixed-size on-chip scratchpad of S bytes, shared among all Waves in the same Workgroup. Local Memory is explicitly addressed via load and store instructions. There is no automatic caching or data placement. Local Memory contents are undefined at Workgroup start and are not preserved across Workgroup boundaries. Implementations MUST provide at least MIN_LOCAL_MEMORY_SIZE bytes.

Hardware Scheduler. Selects a ready Wave for execution each cycle. When a Wave stalls on a memory access, barrier, or other long-latency operation, the scheduler MUST be able to select another resident Wave without software-visible overhead. The scheduling policy is implementation-defined.

  1. All Threads in a Wave execute the same instruction in the same cycle, or appear to from the programmer’s perspective.
  2. Inactive Threads (those on the non-taken path of a divergent branch) produce no architectural side effects (no memory writes, no register updates visible to other Threads).
  3. All Threads in a Workgroup that reach a Barrier will synchronize at that Barrier before any proceeds past it.
  4. Memory operations to Local Memory are visible to all Threads in the Workgroup after a Barrier.
  5. Memory operations to Device Memory are visible according to the memory ordering rules (Section 4).
  6. Wave-level operations (reductions, broadcasts, shuffles) operate only on active Threads.
  7. A compliant implementation MUST eventually make progress on at least one Wave per Core.

A dispatch consists of specifying:

  1. A program (compiled binary containing instructions for this ISA)
  2. Grid dimensions (number of Workgroups in x, y, z)
  3. Workgroup dimensions (number of Threads in x, y, z)
  4. Kernel arguments (Device Memory pointers, constants)

The implementation distributes Workgroups to Cores and creates Waves within each Workgroup. The distribution policy is implementation-defined.

Each Thread has access to R general-purpose registers, where R is declared at compile time and MUST be at least MIN_MAX_REGISTERS. Registers are 32 bits wide and are denoted r0, r1, …, r(R-1). Register contents are undefined at Thread start unless explicitly initialized.

A 32-bit register may be accessed as two 16-bit halves:

  • r0.lo — bits [15:0]
  • r0.hi — bits [31:16]

Or as four 8-bit bytes (optional capability):

  • r0.b0 — bits [7:0]
  • r0.b1 — bits [15:8]
  • r0.b2 — bits [23:16]
  • r0.b3 — bits [31:24]

64-bit operations use register pairs. The notation r0:r1 denotes a 64-bit value where r0 holds the low 32 bits and r1 holds the high 32 bits. Register pairs MUST use adjacent even-odd register numbers (e.g., r0:r1, r2:r3, but not r1:r2).

The following read-only registers are populated by hardware:

  • %tid.x, %tid.y, %tid.z — Thread ID within Workgroup
  • %wid — Wave ID within Workgroup
  • %lid — Lane ID within Wave
  • %ctaid.x, %ctaid.y, %ctaid.z — Workgroup ID within Grid
  • %ntid.x, %ntid.y, %ntid.z — Workgroup dimensions
  • %nctaid.x, %nctaid.y, %nctaid.z — Grid dimensions
  • %nwaves — Number of Waves in this Workgroup
  • %clock — Cycle counter (implementation-defined resolution)

Implementations MUST provide at least 8 predicate registers, denoted p0 through p7. Each predicate register holds a boolean value (1 bit per Thread). Predicate registers are used for conditional execution and branch conditions.

Compilers declare the number of registers required by a kernel. The runtime uses this to compute occupancy:

occupancy = min(
max_waves_per_core,
floor(register_file_size / (registers_per_thread * wave_width * 4)),
floor(local_memory_size / local_memory_per_workgroup)
)

Higher occupancy generally improves latency hiding but reduces registers and Local Memory available per Thread.

The specification defines four memory spaces:

SpaceScopeLifetimeTypical Implementation
PrivateSingle ThreadThread lifetimeRegisters or stack
LocalWorkgroupWorkgroup lifetimeOn-chip SRAM
DeviceAll ThreadsKernel lifetime (at least)VRAM
ConstantAll ThreadsKernel lifetimeCached read-only

Local Memory MUST support atomic operations. Access latency is implementation-defined but SHOULD be significantly lower than Device Memory. Bank conflicts MAY increase access latency but MUST NOT affect correctness. The implementation MUST provide at least MIN_LOCAL_MEMORY_SIZE bytes per Workgroup.

Device Memory is the largest but slowest memory space. It persists across kernel launches (until explicitly deallocated by the host). Device Memory MUST support atomic operations. Coalescing (combining multiple Thread accesses into fewer memory transactions) is implementation-defined but strongly encouraged.

Within a single Thread, memory operations appear to execute in program order. Across Threads, the ordering is relaxed by default. Explicit ordering is achieved via:

Fences. A fence instruction establishes ordering between memory operations before and after the fence. Fences are parameterized by scope:

  • fence.wave — ordering visible to all Threads in the Wave
  • fence.workgroup — ordering visible to all Threads in the Workgroup
  • fence.device — ordering visible to all Threads on the device

Barriers. A Barrier (barrier) implies a Workgroup-scope fence. All memory operations before the Barrier are visible to all Threads in the Workgroup after the Barrier.

Atomic operations are indivisible read-modify-write operations. Implementations MUST support the following atomic operations on both Local and Device Memory:

  • atom.add, atom.sub — addition/subtraction
  • atom.min, atom.max — minimum/maximum (signed and unsigned)
  • atom.and, atom.or, atom.xor — bitwise operations
  • atom.exch — exchange
  • atom.cas — compare-and-swap

64-bit atomics are an optional capability.

All control flow MUST be structured. The specification defines the following control flow primitives:

  • if pd, label_else, label_endif — conditional branch based on predicate pd
  • else — marks the start of the else block
  • endif — marks the end of the if/else construct
  • loop label_end — begins a loop, label_end marks the end
  • endloop — marks the end of a loop
  • break pd — exits the innermost loop if predicate is true
  • continue pd — jumps to the next iteration of the innermost loop

Arbitrary goto is not supported. All control flow graphs MUST be reducible.

A branch is uniform if all active Threads in a Wave take the same direction. Uniform branches have no divergence overhead. Compilers SHOULD annotate branches as uniform when known, allowing optimizations.

The syntax for annotating a branch as uniform is @uniform if pd, .... If a branch annotated as uniform diverges at runtime, behavior is undefined.

When Threads in a Wave disagree on a branch condition, the Wave is divergent. The implementation MUST execute both paths, but the order and mechanism are implementation-defined. Possible mechanisms include:

  • Execution masks (execute both paths with some Threads inactive)
  • Stack-based reconvergence
  • Per-Thread program counters

The specification only requires that:

  1. All Threads reconverge after structured control flow constructs (at endif, endloop).
  2. Inactive Threads produce no side effects.
  3. Wave-level operations operate only on active Threads.

All instructions follow the format:

[predicate] opcode[.modifiers] destination, source1, source2, ...

Where:

  • predicate (optional): @p0@p7 for conditional execution
  • opcode: instruction mnemonic
  • modifiers: type suffixes, rounding modes, etc.
  • destination: target register
  • source: source operands (registers, immediates)
InstructionDescriptionOperation
add.s32 rd, ra, rbSigned 32-bit addrd = ra + rb
add.u32 rd, ra, rbUnsigned 32-bit addrd = ra + rb
sub.s32 rd, ra, rbSigned 32-bit subtractrd = ra - rb
mul.lo.s32 rd, ra, rbSigned multiply (low 32 bits)rd = (ra * rb)[31:0]
mul.hi.s32 rd, ra, rbSigned multiply (high 32 bits)rd = (ra * rb)[63:32]
mul.wide.s32 rd:rd+1, ra, rbSigned 32x32->64 multiplyrd:rd+1 = ra * rb
mad.lo.s32 rd, ra, rb, rcMultiply-add (low bits)rd = (ra * rb)[31:0] + rc
div.s32 rd, ra, rbSigned divisionrd = ra / rb
rem.s32 rd, ra, rbSigned remainderrd = ra % rb
neg.s32 rd, raNegaterd = -ra
abs.s32 rd, raAbsolute valuerd = |ra|
min.s32 rd, ra, rbMinimumrd = min(ra, rb)
max.s32 rd, ra, rbMaximumrd = max(ra, rb)
InstructionDescriptionOperation
and.b32 rd, ra, rbBitwise ANDrd = ra & rb
or.b32 rd, ra, rbBitwise ORrd = ra | rb
xor.b32 rd, ra, rbBitwise XORrd = ra ^ rb
not.b32 rd, raBitwise NOTrd = ~ra
shl.b32 rd, ra, rbShift leftrd = ra << rb
shr.u32 rd, ra, rbLogical shift rightrd = ra >>> rb
shr.s32 rd, ra, rbArithmetic shift rightrd = ra >> rb
popc.b32 rd, raPopulation countrd = popcount(ra)
clz.b32 rd, raCount leading zerosrd = clz(ra)
brev.b32 rd, raBit reverserd = reverse_bits(ra)
bfe.u32 rd, ra, rb, rcBit field extractrd = (ra >> rb) & mask(rc)
bfi.b32 rd, ra, rb, rc, reBit field insertInsert rc bits of ra at position rb into re

All implementations MUST support IEEE 754-2008 single-precision (binary32) with the following exceptions:

  • Denormal inputs MAY be flushed to zero.
  • Denormal outputs MAY be flushed to zero.
  • NaN payloads are implementation-defined.
InstructionDescription
add.f32 rd, ra, rbAddition
sub.f32 rd, ra, rbSubtraction
mul.f32 rd, ra, rbMultiplication
fma.f32 rd, ra, rb, rcFused multiply-add (rd = ra*rb + rc)
div.f32 rd, ra, rbDivision
rcp.f32 rd, raReciprocal (approximate, 1 ULP)
sqrt.f32 rd, raSquare root
rsqrt.f32 rd, raReciprocal square root (approximate)
neg.f32 rd, raNegate
abs.f32 rd, raAbsolute value
min.f32 rd, ra, rbMinimum (IEEE semantics)
max.f32 rd, ra, rbMaximum (IEEE semantics)
sin.f32 rd, raSine (approximate)
cos.f32 rd, raCosine (approximate)
exp2.f32 rd, ra2^x (approximate)
log2.f32 rd, ralog2(x) (approximate)

Half-precision support is an optional capability. When present:

  • Two f16 values are packed into a single 32-bit register.
  • Operations operate on pairs (vec2) or scalars.
InstructionDescription
add.f16x2 rd, ra, rbPacked f16 addition
mul.f16x2 rd, ra, rbPacked f16 multiplication
fma.f16x2 rd, ra, rb, rcPacked f16 FMA

Double-precision support is an optional capability. When present, operations use register pairs.

InstructionDescription
add.f64 rd:rd+1, ra:ra+1, rb:rb+1f64 addition
mul.f64 rd:rd+1, ra:ra+1, rb:rb+1f64 multiplication
fma.f64 rd:rd+1, ra:ra+1, rb:rb+1, rc:rc+1f64 FMA
div.f64 rd:rd+1, ra:ra+1, rb:rb+1f64 division
InstructionDescription
cvt.f32.s32 rd, raSigned int to float
cvt.f32.u32 rd, raUnsigned int to float
cvt.s32.f32 rd, raFloat to signed int (truncate)
cvt.rni.s32.f32 rd, raFloat to signed int (round nearest)
cvt.f16.f32 rd, raf32 to f16 (packed)
cvt.f32.f16 rd, raf16 to f32 (unpacked)
cvt.f64.f32 rd:rd+1, raf32 to f64
cvt.f32.f64 rd, ra:ra+1f64 to f32
InstructionDescription
setp.eq.s32 pd, ra, rbSet predicate if equal
setp.ne.s32 pd, ra, rbSet predicate if not equal
setp.lt.s32 pd, ra, rbSet predicate if less than
setp.le.s32 pd, ra, rbSet predicate if less or equal
setp.gt.s32 pd, ra, rbSet predicate if greater than
setp.ge.s32 pd, ra, rbSet predicate if greater or equal
selp.s32 rd, ra, rb, pdSelect: rd = pd ? ra : rb
slct.s32.f32 rd, ra, rb, rcSelect based on sign: rd = (rc >= 0) ? ra : rb

Floating-point comparisons include special handling for NaN:

  • setp.eq.f32: false if either operand is NaN
  • setp.neu.f32: true if either operand is NaN (unordered not-equal)
InstructionDescription
ld.local.b32 rd, [ra]Load 32 bits from Local Memory
ld.local.b64 rd:rd+1, [ra]Load 64 bits from Local Memory
ld.global.b32 rd, [ra]Load 32 bits from Device Memory
ld.const.b32 rd, [ra]Load 32 bits from Constant Memory
st.local.b32 [ra], rbStore 32 bits to Local Memory
st.global.b32 [ra], rbStore 32 bits to Device Memory

Vector loads/stores are supported for efficiency:

  • ld.global.v2.b32 {rd, rd+1}, [ra] — load 2x32 bits
  • ld.global.v4.b32 {rd, rd+1, rd+2, rd+3}, [ra] — load 4x32 bits
InstructionDescription
atom.local.add.u32 rd, [ra], rbAtomic add to Local Memory
atom.global.add.u32 rd, [ra], rbAtomic add to Device Memory
atom.global.cas.b32 rd, [ra], rb, rcCompare-and-swap
atom.global.exch.b32 rd, [ra], rbExchange
atom.global.min.s32 rd, [ra], rbAtomic minimum
atom.global.max.s32 rd, [ra], rbAtomic maximum

All atomic operations return the value before the operation was applied.

Wave operations perform computation across all active Threads in a Wave.

InstructionDescription
wave.reduce.add.u32 rd, raSum of ra across all active lanes
wave.reduce.min.s32 rd, raMinimum of ra across all active lanes
wave.reduce.max.s32 rd, raMaximum of ra across all active lanes
wave.reduce.and.b32 rd, raBitwise AND across all active lanes
wave.reduce.or.b32 rd, raBitwise OR across all active lanes
wave.broadcast.b32 rd, ra, rbBroadcast ra from lane rb to all lanes
wave.shuffle.b32 rd, ra, rbrd = ra from lane rb
wave.shuffle.xor.b32 rd, ra, rbrd = ra from lane (lane_id ^ rb)
wave.shuffle.up.b32 rd, ra, rbrd = ra from lane (lane_id - rb)
wave.shuffle.down.b32 rd, ra, rbrd = ra from lane (lane_id + rb)
wave.prefix.add.u32 rd, raExclusive prefix sum
wave.ballot.b32 rd, pdrd = bitmask of pd across all lanes
wave.any pd, pspd = true if any active lane has ps=true
wave.all pd, pspd = true if all active lanes have ps=true
InstructionDescription
barrierWorkgroup barrier + memory fence
fence.waveMemory fence, Wave scope
fence.workgroupMemory fence, Workgroup scope
fence.deviceMemory fence, Device scope
InstructionDescription
if pdBegin if block
elseBegin else block
endifEnd if/else block
loopBegin loop
endloopEnd loop
break pdBreak from loop if predicate true
continue pdContinue to next iteration if predicate true
retReturn from kernel

Matrix multiply-accumulate (MMA) operations are an optional capability. When present, implementations support tensor core-style operations:

mma.m16n8k8.f32.f16 {d0,d1,d2,d3}, {a0,a1}, {b0}, {c0,c1,c2,c3}

The exact shapes and data types are queryable via the capability system (see Section 7).

Every implementation MUST provide the following queryable constants:

ConstantMinimumDescription
WAVE_WIDTH16Number of Threads per Wave
MAX_WORKGROUP_SIZE256Maximum Threads per Workgroup
MAX_REGISTERS64Maximum registers per Thread
LOCAL_MEMORY_SIZE16384Bytes of Local Memory per Workgroup
MAX_WAVES_PER_CORE16Maximum resident Waves per Core
PREDICATE_REGISTERS8Number of predicate registers
CLUSTER_SIZE1Workgroups per Cluster (1 = no clusters)

Implementations MAY support the following optional features:

CapabilityDescription
CAP_F16Half-precision floating-point
CAP_F64Double-precision floating-point
CAP_ATOMIC6464-bit atomic operations
CAP_MMAMatrix multiply-accumulate
CAP_DP4A4-element dot product (int8)
CAP_SUBGROUPSSubgroup operations (partial wave)
CAP_CLUSTERCluster-scope operations

If CAP_MMA is present, the following constants define supported MMA shapes:

ConstantDescription
MMA_MM dimension (rows of output)
MMA_NN dimension (columns of output)
MMA_KK dimension (inner dimension)
MMA_INPUT_TYPESBitmask of supported input types
MMA_OUTPUT_TYPESBitmask of supported output types

The host runtime provides a query function:

wave_result wave_get_capability(wave_device device, wave_capability cap, void* value, size_t size);

Where cap is an enumeration of all queryable constants and capabilities. The function writes the value to the provided buffer and returns a success/error code.

Example usage:

uint32_t wave_width;
wave_get_capability(device, WAVE_CAP_WAVE_WIDTH, &wave_width, sizeof(wave_width));

Instructions are encoded in either a 32-bit base format or a 64-bit extended format. The base format accommodates common operations with register operands. The extended format adds support for immediate values, additional operands, and modifiers.

31 26 25 21 20 16 15 11 10 6 5 0
+---------+--------+--------+--------+--------+---------+
| opcode | rd | ra | rb | pred | flags |
| (6 bits)|(5 bits)|(5 bits)|(5 bits)|(5 bits)| (6 bits)|
+---------+--------+--------+--------+--------+---------+
  • opcode: Primary operation code (64 base opcodes)
  • rd: Destination register
  • ra, rb: Source registers
  • pred: Predicate register (0 = unconditional, 1-7 = p1-p7, 8-15 = !p0-!p7)
  • flags: Operation-specific flags (saturation, rounding, etc.)
63 58 57 53 52 48 47 43 42 38 37 32
+---------+--------+--------+--------+--------+--------+
| 1 1 1 | opcode | rd | ra | rb | rc |
| (3 bits)|(5 bits)|(5 bits)|(5 bits)|(5 bits)|(5 bits) |
+---------+--------+--------+--------+--------+--------+
31 0
+------------------------------------------------------+
| immediate |
| (32 bits) |
+------------------------------------------------------+

The extended format is indicated by opcode bits [31:29] = 111. This allows for:

  • 4-operand instructions (FMA, MMA)
  • 32-bit immediate values
  • Extended opcodes (224 additional operations)

A conformant implementation MUST:

  1. Execute all required instructions with correct semantics.
  2. Meet or exceed all minimum capability values.
  3. Provide correct memory ordering per Section 4.
  4. Correctly handle divergence per Section 5.
  5. Report capabilities accurately via Section 7.

The following are implementation-defined:

  • Scheduling policy (Wave selection, Workgroup assignment)
  • Memory coalescing behavior
  • Bank conflict penalties
  • Divergence mechanism
  • NaN payload values
  • Clock counter resolution

The following result in undefined behavior:

  • Barrier within divergent control flow
  • Uniform branch that diverges at runtime
  • Access to unallocated registers
  • Out-of-bounds memory access
  • Data races (concurrent access without synchronization where at least one is a write)

The reference test suite (forthcoming) will include:

  • Instruction correctness tests for all required operations
  • Memory ordering tests
  • Divergence correctness tests
  • Capability query tests
  • Edge case tests (overflow, NaN handling, etc.)

Deferred to a future version of this specification.

This appendix provides suggested mappings to vendor ISAs for reference implementations:

WAVE ConceptNVIDIA (PTX)AMD (RDNA)Intel (Xe)Apple (M-series)
WaveWarp (32)Wave (32/64)SIMD (8/16)SIMD-group (32)
WorkgroupThread BlockWork-groupThread GroupThreadgroup
Local MemoryShared MemoryLDSSLMThreadgroup Memory
Device MemoryGlobal MemoryVRAMGlobal MemoryDevice Memory
wave.shuffleshfl.syncds_permutemov (cross-lane)quad_shuffle
Barrierbar.syncs_barrierbarrierthreadgroup_barrier
Fencefence.scS_WAITCNTscoreboardwait_for_loads
VersionDateChanges
0.12026-03-22Initial draft