Skip to content

Backend Mapping

Every WAVE primitive has a direct mapping to vendor-native instructions, validating the ISA design principle that portable abstractions should not require emulation on any target.

WAVE ConceptNVIDIA PTXAMD RDNA/CDNAIntel XeApple Metal
WaveWarp (32 threads)Wavefront (32 or 64 threads)Sub-group (8 or 16 threads)SIMD-group (32 threads)
WorkgroupCTA (Cooperative Thread Array)WorkgroupWorkgroupThreadgroup
ThreadThread / LaneWork-item / LaneChannelThread / Lane

The terminology varies but the hierarchy is identical: threads are grouped into waves, waves are grouped into workgroups, workgroups are dispatched across compute units.

WAVENVIDIA PTXAMD RDNA/CDNAIntel XeApple Metal
if / else / endif@p bra with convergence barriers; hardware per-thread PC tracks active laness_cbranch_execz / s_cbranch_execnz; compiler manages EXEC mask via s_and_b64, s_or_b64Predicated SIMD execution; channel enable mask updated per branchHardware divergence stack in r0l; structured constructs map nearly 1:1
loop / break / endloopbra with loop-back; bar.sync for convergences_branch / s_cbranch; EXEC mask tracks active lanesPredicated loop with channel maskHardware loop construct with automatic mask management
Predicate registersPredicate registers (@p, @!p)SCC (scalar condition) / VCC (vector condition)Flag register + channel enableCondition code + SIMD mask

Despite radically different mechanisms (hardware PC tracking vs. compiler-managed masks vs. predicated SIMD vs. hardware stack), all four vendors implement the same observable behavior: lanes that fail a condition are masked off, and all lanes reconverge at the end of the structured region. WAVE specifies the behavior; backends choose the mechanism.

WAVENVIDIA PTXAMD RDNA/CDNAIntel XeApple Metal
load.localld.sharedds_read_b32 (LDS)load.slm (SLM)threadgroup load
store.localst.sharedds_write_b32 (LDS)store.slm (SLM)threadgroup store
load.globalld.globalbuffer_load_b32 / flat_load_b32load.ugm (A64)device [[buffer]] load
store.globalst.globalbuffer_store_b32 / flat_store_b32store.ugm (A64)device [[buffer]] store

All four vendors provide a software-managed scratchpad shared within a workgroup. The names differ (shared / LDS / SLM / threadgroup) but the semantics are identical: fast, low-latency memory visible to all waves in the workgroup, explicitly allocated and addressed.

Global memory access goes through different paths on each vendor (NVIDIA’s load/store units, AMD’s VMEM pipeline, Intel’s LSC, Apple’s memory controller) but the programmer-visible behavior is the same: loads and stores to a flat address space visible to all waves on the device.

WAVENVIDIA PTXAMD RDNA/CDNAIntel XeApple Metal
barrierbar.syncs_barrierGateway barrierthreadgroup_barrier()
fence.workgroupmembar.ctas_waitcnt lgkmcnt(0)fence.slmthreadgroup_barrier(mem_flags::mem_threadgroup)
fence.devicemembar.gpus_waitcnt vmcnt(0)fence.ugmthreadgroup_barrier(mem_flags::mem_device)
fence.systemmembar.syss_waitcnt + cache invalidationfence.ugm.sysNot directly exposed (system scope is limited)

Workgroup barriers are the most straightforward mapping: every vendor provides a single instruction that synchronizes all waves in the workgroup. WAVE’s barrier instruction maps 1:1 in every case.

Memory fences show more variation in mechanism (NVIDIA uses explicit membar instructions, AMD counts outstanding operations with s_waitcnt, Intel uses typed fences, Apple piggybacks on barrier flags) but the semantics align: ensure that memory operations before the fence are visible at the specified scope.

WAVENVIDIA PTXAMD RDNA/CDNAIntel XeApple Metal
atom.addatom.global.addflat_atomic_add / global_atomic_addlsc_atomic_iaddatomic_fetch_add_explicit
atom.casatom.global.casflat_atomic_cmpswap / global_atomic_cmpswaplsc_atomic_cmpxchgatomic_compare_exchange_weak_explicit
atom.minatom.global.minflat_atomic_smin / global_atomic_sminlsc_atomic_sminatomic_fetch_min_explicit
atom.maxatom.global.maxflat_atomic_smax / global_atomic_smaxlsc_atomic_smaxatomic_fetch_max_explicit
atom.andatom.global.andflat_atomic_and / global_atomic_andlsc_atomic_andatomic_fetch_and_explicit
atom.oratom.global.orflat_atomic_or / global_atomic_orlsc_atomic_oratomic_fetch_or_explicit
atom.xoratom.global.xorflat_atomic_xor / global_atomic_xorlsc_atomic_xoratomic_fetch_xor_explicit

Every atomic operation in WAVE has a direct 1:1 mapping on all four targets. No emulation is needed.

WAVENVIDIA PTXAMD RDNA/CDNAIntel XeApple Metal
shuffleshfl.sync.idxds_permute_b32sub_group_shufflesimd_shuffle
shuffle_downshfl.sync.downDPP row shift rightsub_group_shuffle_downsimd_shuffle_down
shuffle_upshfl.sync.upDPP row shift leftsub_group_shuffle_upsimd_shuffle_up
shuffle_xorshfl.sync.bflyDPP butterflysub_group_shuffle_xorsimd_shuffle_xor

See Shuffle Primitive for the performance analysis that motivated making shuffle mandatory.

The table below summarizes the mapping quality for each primitive category:

CategoryMapping typeBackend complexity
Arithmetic (iadd, fadd, fmul, …)1:1 instructionTrivial (opcode substitution)
Memory (load, store)1:1 instructionLow (address mode translation)
Atomics1:1 instructionLow (scope annotation)
Barriers1:1 instructionTrivial
Fences1:1 or short sequenceLow (scope-to-mechanism mapping)
Control flowShort instruction sequenceMedium (mask management varies)
Shuffle1:1 instructionLow (variant selection)

No WAVE primitive requires more than a short instruction sequence on any target. This is by design: every primitive was selected because it exists in hardware on all four vendors (see Cross-Vendor Analysis). The backend is a thin translation layer, not a compiler.