Skip to content

Backends

Each WAVE backend reads a .wbin binary and translates it into vendor-native GPU code that can be compiled and dispatched by the vendor’s own toolchain.

wave-metalwave-ptxwave-hipwave-sycl
Output formatMetal Shading Language (MSL)PTX assemblyHIP C++SYCL C++
Target hardwareApple M1-M4, A-seriesNVIDIA Turing+ (SM 75+)AMD RDNA / CDNAIntel Xe / Arc
Vendor toolchainmetallib / Xcodeptxas / CUDA toolkithipcc / ROCmdpcpp / oneAPI
Wave width32 (fixed)32 (warpSize)64 (RDNA: 32/64)varies (subgroup)
F64 supportNoYesYesYes

The Metal backend generates MSL compute kernel functions.

WAVE’s 32 general-purpose registers map to local uint32_t variables:

uint32_t r0 = 0, r1 = 0, r2 = 0, /* ... */ r31 = 0;

Floating-point operations use as_type<float>() for bitwise reinterpretation between uint32_t and float, preserving WAVE’s untyped register model.

  • Local (shared) memory maps to threadgroup uint8_t shared_mem[N], where N is the local memory size declared in the .wbin metadata. Loads and stores use pointer casts at the access site.
  • Device (global) memory maps to device uint8_t* buf [[buffer(0)]] passed as a kernel argument. The backend generates byte-addressed loads and stores with explicit casts to the target type.

WAVE wave-level operations map to Metal’s SIMD-group functions:

WAVE instructionMSL function
wave_shufflesimd_shuffle()
wave_reduce_addsimd_sum()
wave_ballotsimd_ballot()
wave_broadcastsimd_broadcast()

Barriers translate to threadgroup_barrier(mem_flags::mem_threadgroup) for local memory scope and threadgroup_barrier(mem_flags::mem_device) for device memory scope.

  • No F64 support. Apple GPUs do not support double-precision floating-point arithmetic. Kernels using f64 types will fail at the Metal backend with a diagnostic error.
  • Fixed wave width of 32. Metal SIMD-groups are always 32 lanes wide. The backend hardcodes this value rather than querying it at runtime.
  • Buffer binding model. Device memory is passed via Metal buffer bindings ([[buffer(N)]]), which limits the number of distinct device memory arguments.

The PTX backend generates PTX assembly text targeting SM 75 (Turing) or later.

WAVE registers map to PTX virtual registers with type-specific prefixes:

WAVE typePTX registerPTX type
i32 / u32%r0 - %r31.b32
f32%f0 - %f31.f32
f64%rd0 - %rd31.b64

Float bitcasting uses mov.b32 to reinterpret between integer and float register types without conversion:

mov.b32 %f0, %r0; // reinterpret uint32 as float
  • Local (shared) memory maps to .shared .b8 shared_mem[N]; in the PTX .shared state space.
  • Device (global) memory uses 64-bit addressing with ld.global and st.global instructions:
ld.global.b32 %r0, [%rd0]; // load 32 bits from global address in %rd0
st.global.b32 [%rd0], %r1; // store 32 bits to global address

PTX atomics use the atom.global and atom.shared instructions with explicit scope:

atom.global.add.u32 %r0, [%rd0], %r1; // global atomic add
atom.shared.cas.b32 %r0, [%r2], %r3, %r4; // shared CAS

PTX does not provide atom.sub. The backend emits neg followed by atom.add as an equivalent sequence.

  • SM 75+ required. The backend uses features introduced in Turing (e.g., uniform registers, independent thread scheduling). Older GPUs are not supported.
  • No atomic subtract. Implemented as negate-then-add, which is functionally equivalent but uses two instructions.

The HIP backend generates HIP C++ kernel functions compilable with hipcc.

WAVE registers map to local uint32_t variables, identical to the Metal approach:

uint32_t r0 = 0, r1 = 0, /* ... */ r31 = 0;

Float bitcasting uses AMD-specific intrinsics:

float f = __uint_as_float(r0); // uint32 → float
uint32_t u = __float_as_uint(f); // float → uint32
  • Local (shared) memory maps to dynamically-sized shared memory declared with extern __shared__ uint8_t shared_mem[];. The actual size is set at kernel launch.
  • Device (global) memory is passed as a raw pointer parameter (uint8_t* __restrict__ buf).

WAVE wave-level operations map to HIP intrinsics:

WAVE instructionHIP function
wave_shuffle__shfl()
wave_ballot__ballot()
wave_reduce_addManual shuffle-tree reduction
wave_broadcast__shfl(val, src_lane)

The backend uses warpSize rather than hardcoding the wave width, allowing correct execution on both RDNA (wave32/wave64 configurable) and CDNA (wave64) hardware.

Barriers map to __syncthreads() for workgroup scope. The backend does not currently emit wave-level barriers as AMD hardware provides implicit wave-level synchrony.

  • Wave reduce is emulated. HIP does not expose a single-instruction wave reduce. The backend generates a butterfly reduction tree using __shfl_xor().
  • RDNA vs. CDNA wave width. While the backend uses warpSize for correctness, performance tuning may differ between wave32 (RDNA) and wave64 (CDNA) modes.

The SYCL backend generates SYCL C++ kernel functions targeting the oneAPI DPC++ compiler.

WAVE registers map to local uint32_t variables:

uint32_t r0 = 0, r1 = 0, /* ... */ r31 = 0;

Float bitcasting uses the SYCL standard library:

float f = sycl::bit_cast<float>(r0);
uint32_t u = sycl::bit_cast<uint32_t>(f);
  • Local (shared) memory maps to sycl::local_accessor<uint8_t, 1> bound during command group submission. The accessor is passed to the kernel as a parameter.
  • Device (global) memory uses SYCL Unified Shared Memory (USM) pointers (uint8_t*), allocated with sycl::malloc_device() and passed directly to the kernel.

WAVE wave-level operations map to SYCL sub-group functions:

WAVE instructionSYCL function
wave_shufflesycl::select_from_group(sg, val, lane)
wave_reduce_addsycl::reduce_over_group(sg, val, sycl::plus<>())
wave_ballotsycl::group_ballot(sg, pred)
wave_broadcastsycl::group_broadcast(sg, val, lane)

SYCL atomics use sycl::atomic_ref with explicit memory order and scope:

sycl::atomic_ref<uint32_t, sycl::memory_order::relaxed,
sycl::memory_scope::device,
sycl::access::address_space::global_space> ref(buf[offset]);
ref.fetch_add(val);

Memory fences use sycl::atomic_fence() with configurable scope matching WAVE’s four-level scoping model (wave, workgroup, device, system).

  • Sub-group size variability. Intel GPUs support multiple sub-group sizes (8, 16, 32). The backend requests a specific size via [[intel::reqd_sub_group_size(N)]] but the actual size depends on the target hardware.
  • Local memory model. SYCL’s accessor-based local memory model differs from the raw pointer model used by Metal, PTX, and HIP. The backend must manage accessor creation during command group submission, which adds complexity to the host-side dispatch code.

Every backend must reinterpret uint32_t register values as float without numeric conversion. Each uses a different mechanism:

BackendMechanism
wave-metalas_type<float>(r)
wave-ptxmov.b32 %f, %r
wave-hip__uint_as_float(r)
wave-syclsycl::bit_cast<float>(r)
BackendDeclaration
wave-metalthreadgroup uint8_t shared_mem[N]
wave-ptx.shared .b8 shared_mem[N]
wave-hipextern __shared__ uint8_t shared_mem[]
wave-syclsycl::local_accessor<uint8_t, 1>
BackendInstruction
wave-metalthreadgroup_barrier(mem_flags::mem_threadgroup)
wave-ptxbar.sync 0
wave-hip__syncthreads()
wave-syclsycl::group_barrier(wg)

Next: Emulator for how wave-emu executes .wbin binaries without a GPU.