Introduction
WAVE (Wide Architecture Virtual Encoding) is a vendor-neutral GPU instruction set architecture that lets you write GPU kernels once and run them on Apple, NVIDIA, AMD, and Intel hardware without modification.
The Problem: GPU Vendor Lock-In
Section titled “The Problem: GPU Vendor Lock-In”GPU computing today is fragmented. NVIDIA’s CUDA dominates with its mature ecosystem, but CUDA kernels only run on NVIDIA hardware. If you want to target Apple Silicon, you rewrite in Metal Shading Language. AMD requires ROCm/HIP. Intel requires oneAPI/SYCL. Each vendor defines its own ISA, memory model, synchronization primitives, and toolchain.
This means:
- Duplicated engineering effort - the same algorithm rewritten 2-4 times for different vendors.
- Vendor lock-in - choosing CUDA today locks out every non-NVIDIA GPU tomorrow.
- Fragmented testing - each port requires its own validation and performance tuning.
- Limited portability - researchers and startups cannot afford to maintain multiple GPU backends.
The Insight: 11 Hardware-Invariant Primitives
Section titled “The Insight: 11 Hardware-Invariant Primitives”WAVE was built on a systematic analysis of 5,000+ pages of vendor ISA documentation spanning 16 microarchitectures across all four major GPU vendors. That analysis revealed that despite surface-level differences in encoding, naming, and calling conventions, every GPU architecture provides the same 11 categories of fundamental operations:
| # | Primitive Category | Description | Examples |
|---|---|---|---|
| 1 | Integer Arithmetic | Addition, subtraction, multiplication, division, modulo on integer types | iadd, imul, idiv |
| 2 | Floating-Point Arithmetic | IEEE 754 arithmetic on 16/32/64-bit floats | fadd, fmul, fdiv, fma |
| 3 | Bitwise Operations | AND, OR, XOR, NOT, shifts, population count | and, or, shl, shr |
| 4 | Comparison & Selection | Compare two values, set predicates, conditional select | cmp, sel |
| 5 | Local (Shared) Memory | Read/write to workgroup-visible scratchpad memory | lds_load, lds_store |
| 6 | Device (Global) Memory | Read/write to device-wide memory with configurable scope | load, store |
| 7 | Atomic Operations | Atomic read-modify-write with scoped visibility | atom_add, atom_cas |
| 8 | Wave/Warp Operations | Subgroup-level shuffles, reductions, ballots | wave_shuffle, wave_reduce |
| 9 | Control Flow | Structured branches, loops, and function calls | if/else/endif, loop/break/endloop |
| 10 | Synchronization | Barriers and memory fences at multiple scopes | barrier, fence |
| 11 | Type Conversion | Widening, narrowing, float-int, and format conversions | cvt_f32_i32, cvt_f16_f32 |
A twelfth capability - special register access - provides read-only access to hardware identity registers such as thread ID, workgroup ID, and workgroup dimensions via 16 special registers.
How the Pipeline Works
Section titled “How the Pipeline Works”WAVE defines a complete compilation pipeline from source to GPU execution:
Source kernel │ ▼wave-compiler Parses WAVE assembly, validates, encodes │ ▼ .wbin Portable binary (32-bit base / 64-bit extended instructions) │ ▼Backend wave-metal | wave-ptx | wave-hip | wave-sycl │ ▼Vendor GPU code Metal IR | PTX | HIP C++ | SYCL C++ │ ▼ GPU Dispatched to hardware via vendor runtimeThe Toolchain
Section titled “The Toolchain”The WAVE toolchain consists of four core tools:
- wave-compiler - Compiles WAVE assembly source into the
.wbinbinary format. Performs validation, register allocation checks, and encoding. - wave-asm - Standalone assembler that converts WAVE assembly text into
.wbinbinaries. - wave-dis - Disassembler that converts
.wbinbinaries back into human-readable WAVE assembly for inspection and debugging. - wave-emu - Instruction-level emulator that executes
.wbinbinaries on the CPU without requiring a GPU. Supports all 11 primitive categories with cycle-accurate memory model semantics.
The Backends
Section titled “The Backends”Four backends translate .wbin to vendor-native code:
| Backend | Target | Output |
|---|---|---|
wave-metal | Apple GPUs (M1-M4, A-series) | Metal Shading Language / Metal IR |
wave-ptx | NVIDIA GPUs (Turing, Ampere, Hopper) | PTX assembly |
wave-hip | AMD GPUs (RDNA, CDNA) | HIP C++ |
wave-sycl | Intel GPUs (Xe, Arc) | SYCL C++ |
The Binary Format
Section titled “The Binary Format”WAVE uses a compact binary encoding:
- Base instructions are 32 bits wide and encode opcode, destination register, and up to two source operands.
- Extended instructions are 64 bits wide and support three or more operands, immediate values, and memory addressing modes.
- The register file provides 32 general-purpose registers (
r0-r31), 4 predicate registers (p0-p3) for conditional execution, and 16 special registers for hardware identity and configuration.
The Memory Model
Section titled “The Memory Model”WAVE defines a scoped memory model with four visibility levels:
- Wave - visible within a single wave/warp/simdgroup.
- Workgroup - visible within a single workgroup/threadblock/threadgroup.
- Device - visible across all workgroups on the device.
- System - visible across the device and host CPU.
Memory operations and fences specify their scope explicitly, giving the programmer precise control over visibility and ordering.
Research Foundation
Section titled “Research Foundation”The WAVE ISA is the product of peer-level research. The full specification, primitive derivation methodology, and cross-vendor verification results are published on Zenodo under DOI 10.5281/zenodo.19163452. The paper is in preparation for submission to ASPLOS 2027.
Verification has been performed on three hardware platforms:
- Apple M4 Pro (Metal backend)
- NVIDIA T4 (PTX backend)
- AMD MI300X (HIP backend)
Each primitive was validated against the vendor’s native ISA specification to ensure semantic equivalence.
Next: Installation - install the SDK for your language.