Memory Model
WAVE exposes three distinct memory levels - registers, local memory, and device memory - each with different scope, latency, and capacity trade-offs.
Memory Hierarchy at a Glance
Section titled “Memory Hierarchy at a Glance”┌─────────────────────────────────────────────┐│ Device Memory ││ (global, GB scale, ~300 cycles) │├─────────────────────────────────────────────┤│ Local Memory (per-workgroup) ││ (shared, 48–96 KB, ~20 cycles) │├──────────┬──────────┬──────────┬────────────┤│ Thread │ Thread │ Thread │ ... ││ r0–r31 │ r0–r31 │ r0–r31 │ ││ (1 cyc) │ (1 cyc) │ (1 cyc) │ │└──────────┴──────────┴──────────┴────────────┘| Level | Scope | Addressable | Typical Size | Latency |
|---|---|---|---|---|
Registers (r0–r31) | Per-thread | By name | 32 x 32-bit per thread | ~1 cycle |
| Local memory | Per-workgroup | Byte-addressable | 48–96 KB | ~20 cycles |
| Device memory | Global | 64-bit byte-addressable | GBs | ~300 cycles |
Registers
Section titled “Registers”Each thread has its own set of 32 general-purpose registers (r0–r31). Registers are the fastest storage but the scarcest resource. Other threads cannot see your registers - they are completely private.
mov_imm r0, 100 ; immediate to registeriadd r2, r0, r1 ; register-to-register arithmeticLoad and Store Operations
Section titled “Load and Store Operations”Widths
Section titled “Widths”Both local_load/local_store and device_load/device_store support five widths:
| Suffix | Width | Use case |
|---|---|---|
u8 | 8-bit | Byte-level data, characters |
u16 | 16-bit | Half-precision floats, short integers |
u32 | 32-bit | Standard floats and integers |
u64 | 64-bit | Doubles, 64-bit addresses, pointers |
u128 | 128-bit | Wide loads for throughput (4 packed floats) |
device_load.u32 r0, r1 ; load 32 bits from device address in r1device_load.u128 r0, r1 ; load 128 bits (fills r0–r3)local_load.u32 r0, r1 ; load 32 bits from local address in r1device_store.u64 r0, r1 ; store 64 bits (r0–r1) to device addressCache Hints
Section titled “Cache Hints”Every device memory operation accepts an optional cache hint that controls caching behavior:
| Hint | Value | Meaning |
|---|---|---|
cached | 0 | Default. Use the cache hierarchy normally. |
uncached | 1 | Bypass caches. Use for data that will not be reused. |
streaming | 2 | Cache with low priority. Use for data accessed once in a streaming pattern. |
device_load.u32.cached r0, r1 ; normal caching (default)device_load.u32.uncached r0, r1 ; bypass cachedevice_load.u32.streaming r0, r1 ; streaming / low-priority cacheWhen to use each hint:
- Use
cached(or omit the hint) for data that may be read multiple times. - Use
streamingfor large sequential scans where each element is touched once. - Use
uncachedfor data written by other workgroups that you need to read fresh.
Local Memory (Shared Memory)
Section titled “Local Memory (Shared Memory)”Local memory is shared among all threads in a workgroup. It is small (48–96 KB), fast (~20 cycle latency), and byte-addressable. Common uses include:
- Staging data from device memory for repeated access
- Communication between threads in the same workgroup
- Building partial results (e.g., reduction scratch space)
; Thread 0 writes a value; all threads in the workgroup can read it.mov_imm r0, 42local_store.u32 r0, r1 ; store r0 at local address r1
barrier ; ensure all threads see the write
local_load.u32 r2, r1 ; any thread can now load the valueImportant: You must insert a barrier between a local store and a local load from a different thread. Without the barrier, a thread may read stale or uninitialized data.
Example: Cooperative Device-to-Local Tiling
Section titled “Example: Cooperative Device-to-Local Tiling”A common pattern is for each thread to load one element from device memory into local memory, synchronize, then have every thread read from the faster local copy:
; Each thread loads one element from device memory into local memorymov_sr r0, sr_thread_id_xshl r1, r0, 2 ; local byte offset = tid * 4iadd r2, r10, r1 ; device address = base + offsetdevice_load.u32 r3, r2 ; load from device memorylocal_store.u32 r3, r1 ; store into local memory
barrier ; wait for all threads to finish writing
; Now every thread can read any element from local memory; e.g., read the neighbor's valueiadd r4, r0, 1 ; neighbor index = tid + 1shl r5, r4, 2 ; neighbor byte offsetlocal_load.u32 r6, r5 ; load neighbor's value from local memoryAtomic Operations
Section titled “Atomic Operations”Atomics perform read-modify-write on a single memory location indivisibly. They work on both local and device memory.
Available Atomic Operations
Section titled “Available Atomic Operations”| Operation | Syntax | Description |
|---|---|---|
| Add | atomic_add | *addr += val |
| Sub | atomic_sub | *addr -= val |
| Min | atomic_min | *addr = min(*addr, val) |
| Max | atomic_max | *addr = max(*addr, val) |
| And | atomic_and | *addr &= val |
| Or | atomic_or | *addr |= val |
| Xor | atomic_xor | *addr ^= val |
| Exchange | atomic_exchange | old = *addr; *addr = val; return old |
| CAS | atomic_cas | Compare-and-swap |
; Atomically add r0 to the value at device address r1; Result (old value) returned in r2atomic_add r2, r1, r0
; Compare-and-swap: if *r1 == r0, set *r1 = r3; old value in r2atomic_cas r2, r1, r0, r3Memory Scopes
Section titled “Memory Scopes”Every atomic and fence operation has a scope that determines which threads are guaranteed to observe the effect:
| Scope | Value | Visibility |
|---|---|---|
wave | 0 | Threads within the same wave |
workgroup | 1 | Threads within the same workgroup |
device | 2 | All threads on the GPU |
system | 3 | GPU and CPU (host) |
Choose the narrowest scope that satisfies your correctness requirements. Wider scopes are more expensive because the hardware must flush or invalidate more caches.
Fences
Section titled “Fences”Fences enforce ordering of memory operations without operating on a specific address. WAVE provides three fence types:
| Fence | Guarantees |
|---|---|
fence_acquire | All loads after this fence see writes that happened before a matching release. |
fence_release | All writes before this fence are visible to threads that perform a matching acquire. |
fence_acq_rel | Both acquire and release semantics combined. |
Each fence takes a scope:
fence_release.workgroup ; make all prior writes visible within the workgroupbarrier ; synchronize threadsfence_acquire.workgroup ; see all writes from before the barrierWhen to Use Each Scope
Section titled “When to Use Each Scope”wave: Communication between lanes in the same wave (usually handled by wave ops instead).workgroup: The most common scope. Use withbarrierfor local memory synchronization.device: Cross-workgroup communication through device memory (e.g., global counters, producer-consumer between workgroups).system: When the CPU needs to observe GPU writes, or vice versa (e.g., signaling completion to the host).
Example: Device-Scope Counter
Section titled “Example: Device-Scope Counter”; Atomically increment a global counter visible to all workgroupsmov_imm r0, 1atomic_add r1, r10, r0 ; r10 = address of global counter
; Ensure subsequent reads see the updated counter across the devicefence_acq_rel.deviceBarrier
Section titled “Barrier”barrier is a workgroup-level synchronization point. When a thread reaches a barrier, it waits until every thread in the workgroup has also reached it. This is the standard way to synchronize local memory access.
Rule of thumb: if one thread writes to local memory and another thread reads that address, there must be a barrier between the write and the read.
local_store.u32 r0, r1 ; writebarrier ; all threads sync herelocal_load.u32 r2, r3 ; safe to read any thread's writeSummary
Section titled “Summary”| Operation | Local Memory | Device Memory |
|---|---|---|
| Load | local_load.{u8–u128} | device_load.{u8–u128}[.hint] |
| Store | local_store.{u8–u128} | device_store.{u8–u128}[.hint] |
| Atomics | atomic_* on local addr | atomic_* on device addr |
| Synchronization | barrier + fences | Fences with device/system scope |
Next: Control Flow - learn how branching, loops, and divergence work in WAVE assembly.