Skip to content

Memory Scoping

WAVE adopts a scoped acquire/release memory model that threads the needle between TSO (too restrictive for GPU hardware) and fully relaxed semantics (too weak for programmers to reason about correctly).

GPU memory models sit on a spectrum:

  • TSO (Total Store Order): Used by x86 CPUs. Every store is immediately visible to all threads in program order. Simple to reason about, but GPUs cannot implement it efficiently --- their deep memory hierarchies, per-CU caches, and asynchronous memory controllers would require constant flushing.

  • Fully relaxed: Loads and stores can be reordered arbitrarily. Maximizes hardware freedom but pushes all correctness burden onto programmers, who must insert fences everywhere.

  • Scoped acquire/release: The middle ground. Ordering is enforced only when explicitly requested, and only at the scope where it matters.

WAVE defines four memory scopes, matching the GPU execution hierarchy:

ScopeVisibilityUse case
WaveLanes within the executing waveShuffle results, wave-level reductions
WorkgroupAll waves in the workgroupShared local memory synchronization
DeviceAll waves on the GPUGlobal memory coordination between workgroups
SystemGPU and host CPUHost-device communication, mapped buffers

Each scope corresponds to a distinct hardware boundary with different latency and coherence characteristics:

  • Wave scope is free --- lanes within a wave share a register file or L0 cache. No fence is needed.
  • Workgroup scope requires flushing to the local data store (LDS/SLM/threadgroup memory), which is fast (tens of cycles).
  • Device scope requires flushing to L2 or device memory, which is slower (hundreds of cycles).
  • System scope requires cache invalidation visible to the CPU’s coherence domain, which is the most expensive (potentially thousands of cycles).

Collapsing these into fewer levels would force over-synchronization. For example, if workgroup and device scope were merged, every local memory fence would pay the cost of a global fence.

WAVE uses acquire/release ordering for atomic operations and fences:

  • Acquire: No subsequent memory operation in the executing thread can be reordered before this operation. Used when reading shared state (“I need to see everything that happened before the release that made this value visible”).

  • Release: No preceding memory operation in the executing thread can be reordered after this operation. Used when publishing shared state (“everything I wrote before this point must be visible to anyone who acquires”).

; Producer (wave 0)
store.global [addr], r0 ; write data
fence.release.workgroup ; ensure store is visible at workgroup scope
store.global.release [flag], r1 ; signal completion
; Consumer (wave 1)
load.global.acquire r2, [flag] ; acquire: see all stores before the release
load.global [addr], r3 ; guaranteed to see r0's value

The scope field in the instruction encoding (bits [6:5]) specifies the fence scope. A fence.release.workgroup ensures visibility only within the workgroup --- it does not flush to device-wide caches. A fence.release.device flushes further. The backend maps each scope to the appropriate vendor-specific mechanism.

ConceptNVIDIAAMDIntelApple
Formal modelAxiomatic (published with PTX)Operational (S_WAITCNT counters)Scoreboard-basedImplicit (async loads)
Workgroup fencemembar.ctas_waitcnt lgkmcnt(0)fence.slmthreadgroup_barrier(mem_flags::mem_threadgroup)
Device fencemembar.gpus_waitcnt vmcnt(0)fence.ugmthreadgroup_barrier(mem_flags::mem_device)
System fencemembar.syss_waitcnt vmcnt(0) + cache flushfence.ugm.sysNot directly exposed
Acquireld.acquires_waitcnt + buffer_gl0_invload.ugm.caImplicit via load ordering
Releasest.releases_waitcnt before storestore.ugm.ucImplicit via store ordering

Every vendor provides mechanisms that map naturally to scoped acquire/release. No vendor requires TSO. No vendor leaves ordering fully relaxed at the ISA level --- they all provide some form of scoped fencing. WAVE’s model captures the common denominator.

A fully relaxed model with explicit fences would technically work, but it shifts the correctness burden entirely to the programmer (or compiler). Experience from C11/C++11 memory models shows that relaxed atomics are a persistent source of bugs. By making acquire/release the default for atomic operations and providing scoped fences, WAVE ensures that the most common synchronization patterns are correct by construction.

TSO would require that every store become globally visible before any subsequent load executes. On a GPU with 64+ active waves per compute unit, each with 32—64 lanes, this would require serializing thousands of concurrent stores through a single coherence point. The performance cost would eliminate the GPU’s primary advantage: massive parallelism. Every vendor’s hardware is designed around relaxed ordering with explicit synchronization points, and WAVE’s model respects this reality.