Skip to content

Introduction

WAVE (Wide Architecture Virtual Encoding) is a vendor-neutral GPU instruction set architecture that lets you write GPU kernels once and run them on Apple, NVIDIA, AMD, and Intel hardware without modification.

GPU computing today is fragmented. NVIDIA’s CUDA dominates with its mature ecosystem, but CUDA kernels only run on NVIDIA hardware. If you want to target Apple Silicon, you rewrite in Metal Shading Language. AMD requires ROCm/HIP. Intel requires oneAPI/SYCL. Each vendor defines its own ISA, memory model, synchronization primitives, and toolchain.

This means:

  • Duplicated engineering effort - the same algorithm rewritten 2-4 times for different vendors.
  • Vendor lock-in - choosing CUDA today locks out every non-NVIDIA GPU tomorrow.
  • Fragmented testing - each port requires its own validation and performance tuning.
  • Limited portability - researchers and startups cannot afford to maintain multiple GPU backends.

The Insight: 11 Hardware-Invariant Primitives

Section titled “The Insight: 11 Hardware-Invariant Primitives”

WAVE was built on a systematic analysis of 5,000+ pages of vendor ISA documentation spanning 16 microarchitectures across all four major GPU vendors. That analysis revealed that despite surface-level differences in encoding, naming, and calling conventions, every GPU architecture provides the same 11 categories of fundamental operations:

#Primitive CategoryDescriptionExamples
1Integer ArithmeticAddition, subtraction, multiplication, division, modulo on integer typesiadd, imul, idiv
2Floating-Point ArithmeticIEEE 754 arithmetic on 16/32/64-bit floatsfadd, fmul, fdiv, fma
3Bitwise OperationsAND, OR, XOR, NOT, shifts, population countand, or, shl, shr
4Comparison & SelectionCompare two values, set predicates, conditional selectcmp, sel
5Local (Shared) MemoryRead/write to workgroup-visible scratchpad memorylds_load, lds_store
6Device (Global) MemoryRead/write to device-wide memory with configurable scopeload, store
7Atomic OperationsAtomic read-modify-write with scoped visibilityatom_add, atom_cas
8Wave/Warp OperationsSubgroup-level shuffles, reductions, ballotswave_shuffle, wave_reduce
9Control FlowStructured branches, loops, and function callsif/else/endif, loop/break/endloop
10SynchronizationBarriers and memory fences at multiple scopesbarrier, fence
11Type ConversionWidening, narrowing, float-int, and format conversionscvt_f32_i32, cvt_f16_f32

A twelfth capability - special register access - provides read-only access to hardware identity registers such as thread ID, workgroup ID, and workgroup dimensions via 16 special registers.

WAVE defines a complete compilation pipeline from source to GPU execution:

Source kernel
wave-compiler Parses WAVE assembly, validates, encodes
.wbin Portable binary (32-bit base / 64-bit extended instructions)
Backend wave-metal | wave-ptx | wave-hip | wave-sycl
Vendor GPU code Metal IR | PTX | HIP C++ | SYCL C++
GPU Dispatched to hardware via vendor runtime

The WAVE toolchain consists of four core tools:

  • wave-compiler - Compiles WAVE assembly source into the .wbin binary format. Performs validation, register allocation checks, and encoding.
  • wave-asm - Standalone assembler that converts WAVE assembly text into .wbin binaries.
  • wave-dis - Disassembler that converts .wbin binaries back into human-readable WAVE assembly for inspection and debugging.
  • wave-emu - Instruction-level emulator that executes .wbin binaries on the CPU without requiring a GPU. Supports all 11 primitive categories with cycle-accurate memory model semantics.

Four backends translate .wbin to vendor-native code:

BackendTargetOutput
wave-metalApple GPUs (M1-M4, A-series)Metal Shading Language / Metal IR
wave-ptxNVIDIA GPUs (Turing, Ampere, Hopper)PTX assembly
wave-hipAMD GPUs (RDNA, CDNA)HIP C++
wave-syclIntel GPUs (Xe, Arc)SYCL C++

WAVE uses a compact binary encoding:

  • Base instructions are 32 bits wide and encode opcode, destination register, and up to two source operands.
  • Extended instructions are 64 bits wide and support three or more operands, immediate values, and memory addressing modes.
  • The register file provides 32 general-purpose registers (r0-r31), 4 predicate registers (p0-p3) for conditional execution, and 16 special registers for hardware identity and configuration.

WAVE defines a scoped memory model with four visibility levels:

  1. Wave - visible within a single wave/warp/simdgroup.
  2. Workgroup - visible within a single workgroup/threadblock/threadgroup.
  3. Device - visible across all workgroups on the device.
  4. System - visible across the device and host CPU.

Memory operations and fences specify their scope explicitly, giving the programmer precise control over visibility and ordering.

The WAVE ISA is the product of peer-level research. The full specification, primitive derivation methodology, and cross-vendor verification results are published on Zenodo under DOI 10.5281/zenodo.19163452. The paper is in preparation for submission to ASPLOS 2027.

Verification has been performed on three hardware platforms:

  • Apple M4 Pro (Metal backend)
  • NVIDIA T4 (PTX backend)
  • AMD MI300X (HIP backend)

Each primitive was validated against the vendor’s native ISA specification to ensure semantic equivalence.

Next: Installation - install the SDK for your language.