CPU programming has addresses. A pointer is a number that locates a value in a flat memory space. The operating system has virtual memory, caches complicate the picture, and NUMA architectures have non uniform access costs, but the mental model most programmers carry is simpler: memory is a long array, a pointer is an index into that array, and an access is an access.
GPU memory is organized differently. The GPU has distinct physical memory regions with different bandwidth, different latency, different caching behavior, and different rules about which code can read or write them. MSL makes these regions explicit through address space qualifiers. Every pointer, every buffer argument, every shared array carries an address space. The compiler enforces the rules and requires them on every declaration that touches memory.
Understanding address spaces means understanding where data lives and how fast you can get to it.
The four address spaces
device: GPU global memory
device is the GPU's main memory: the large pool that corresponds to what you think of as VRAM. Buffers you allocate on the Swift side with device.makeBuffer() live here. Textures live here. The data persists across frames and across multiple shaders in the same frame. Any shader on the GPU can access it.
kernel void process(
device float *input [[buffer(0)]],
device float *output [[buffer(1)]],
uint tid [[thread_position_in_grid]]
) {
output[tid] = input[tid] * 2.0;
}
device memory is large and slow relative to the other address spaces. On Apple silicon, device memory is unified with CPU memory (the same physical DRAM), which means you can write data on the CPU and read it on the GPU without explicit copy operations. The bandwidth is high but the latency is measured in hundreds of cycles. The GPU hides this latency by executing other threads while one set waits for memory, which is why keeping the GPU occupied with enough parallel work matters.
Access to device memory is cacheable. The GPU has L1 and L2 caches, and coalesced access patterns let adjacent threads' reads merge into single cache lines. Scattered random access defeats caching and collapses throughput. This is the access pattern concern from Part 1 in concrete terms.
By default, device pointers are read write. You declare read only device memory with const device:
kernel void sum(
const device float *data [[buffer(0)]],
device float *result [[buffer(1)]],
// ...
) { /* ... */ }
The const qualifier is not just documentation. On some hardware it enables additional caching modes optimized for read only data.
constant: small, fast, read only uniform data
constant address space is the GPU's fast path for data that every thread reads identically. Configuration values, transformation matrices, material parameters, time, screen resolution. Data that does not vary per thread and does not change during the shader's execution.
struct Uniforms {
float4x4 modelViewProjection;
float time;
float2 resolution;
float4 color;
};
vertex VertexOut my_vertex(
uint vid [[vertex_id]],
constant Uniforms &uniforms [[buffer(0)]],
device const Vertex *vertices [[buffer(1)]]
) {
VertexOut out;
out.position = uniforms.modelViewProjection * float4(vertices[vid].position, 1.0);
return out;
}
constant memory is read only. The compiler will reject any attempt to write to it. The hardware caches constant reads aggressively because all threads in a SIMD group reading the same address is a pattern the cache hardware understands and optimizes for.
The size of constant address space is limited. It is backed by special purpose cache that is much smaller than device memory. Uploading a megabyte of data as constant is not how it is meant to be used. Transformation matrices, per frame parameters, per material settings: small structs passed frequently to many threads. That is the pattern.
When passing small structs from Swift, you can embed them directly in the shader function argument table without a separate buffer allocation:
var uniforms = Uniforms(/* ... */)
encoder.setVertexBytes(&uniforms, length: MemoryLayout<Uniforms>.size, index: 0)
This is more efficient than allocating a buffer for data that fits in a few hundred bytes.
threadgroup: shared memory within one threadgroup
threadgroup memory is shared between all threads in the same threadgroup. It is physically located on chip, close to the execution units. Access is much faster than device memory, comparable to L1 cache or better. But its size is small (typically 16-32 KB) and it exists only for the duration of a threadgroup's execution.
kernel void reduce_sum(
device const float *input [[buffer(0)]],
device float *output [[buffer(1)]],
threadgroup float *localData [[threadgroup(0)]],
uint lid [[thread_position_in_threadgroup]],
uint tid [[thread_position_in_grid]],
uint tgSize [[threads_per_threadgroup]]
) {
// Each thread loads one element into fast shared memory
localData[lid] = input[tid];
// Synchronize: all threads must reach this barrier before any proceed
threadgroup_barrier(mem_flags::mem_threadgroup);
// Reduce within the threadgroup
for (uint stride = tgSize / 2; stride > 0; stride >>= 1) {
if (lid < stride) {
localData[lid] += localData[lid + stride];
}
threadgroup_barrier(mem_flags::mem_threadgroup);
}
// Thread 0 writes the threadgroup's sum to output
if (lid == 0) {
output[tid / tgSize] = localData[0];
}
}
The threadgroup_barrier() call is load bearing in this pattern. Without it, threads might read from localData before other threads have written their contributions. The barrier guarantees that all threads in the threadgroup reach it before any of them proceed past it. This is the primary synchronization mechanism within a threadgroup.
When you declare a threadgroup array in a function, you can either specify its size statically or receive it as a parameter from Swift. The parameter route lets you size the allocation based on runtime conditions:
encoder.setThreadgroupMemoryLength(
MemoryLayout<Float>.size * threadsPerGroup,
index: 0
)
threadgroup memory is unavailable in vertex and fragment shaders. It exists only in compute kernels and a few advanced shader types. This is one of the concrete differences in capability between the three function types.
thread: private register memory
thread address space is local to one thread. It corresponds to the CPU's stack: registers and local variables that exist for the duration of the function and are invisible to all other threads.
You never explicitly write the thread qualifier in most cases. Local variables are implicitly thread addressed:
kernel void example(/* ... */) {
float localValue = 3.14; // implicitly thread address space
float2 localVector = float2(1.0, 0.0); // also thread
// These are invisible to other threads
}
thread variables have zero access cost in the performance sense: they live in registers. Accessing a local variable is not a memory operation at all. This is the fastest storage you have. The limitation is size. GPU register files are finite, and declaring many local variables means fewer simultaneous threads the hardware can schedule, reducing occupancy and latency hiding. Keeping the local variable footprint small is one of the levers for improving GPU performance.
Why every pointer carries an address space
In C++, a pointer is a pointer. The machine decides where the memory came from; the type says nothing about it. In MSL, address spaces are part of the pointer type:
// These are four different types:
device float *p1; // pointer to float in device memory
constant float *p2; // pointer to float in constant memory
threadgroup float *p3; // pointer to float in threadgroup memory
thread float *p4; // pointer to float in thread-private memory
You cannot assign a device float * to a constant float *. The types are incompatible. The compiler knows, at every use site, which memory region a pointer points to, and can generate the correct machine instruction for that region. Each address space may require different instructions to access it on the hardware.
This prevents a class of bugs that plagues GPU programming in lower level APIs: passing the wrong kind of pointer and getting garbage data or undefined behavior. MSL's type system encodes the information the hardware cares about.
Memory coherency
The GPU's memory model is not as strong as the CPU's. Threads in different threadgroups do not have a guaranteed ordering of their writes to device memory relative to each other during the same dispatch. A write by thread A in threadgroup 1 may not be visible to thread B in threadgroup 2 during the same kernel execution.
This is by design. Guaranteeing global ordering across thousands of simultaneous threads would require expensive synchronization hardware that would destroy the throughput advantage of running them in parallel.
The practical implications:
Within a threadgroup, threadgroup_barrier(mem_flags::mem_threadgroup) guarantees that all threadgroup writes before the barrier are visible to all threads in the group after it. This is enough for the reduction pattern shown above.
// mem_flags options:
threadgroup_barrier(mem_flags::mem_threadgroup); // threadgroup memory fence
threadgroup_barrier(mem_flags::mem_device); // device memory fence
threadgroup_barrier(mem_flags::mem_none); // execution barrier only, no memory fence
For communication between threadgroups in the same dispatch, you typically need to split the work into multiple dispatches, with the CPU recording them separately. The Metal API's semaphore mechanisms and signaling primitives handle this at the command buffer level.
A concrete memory layout example
Consider a particle system update kernel. Each particle has a position, velocity, and lifetime:
struct Particle {
float3 position;
float3 velocity;
float lifetime;
float padding; // alignment
};
struct SimParams {
float dt;
float gravity;
uint particleCount;
};
kernel void update_particles(
device Particle *particles [[buffer(0)]], // read write, device
const device Particle *oldParticles [[buffer(1)]], // read only source
constant SimParams ¶ms [[buffer(2)]], // read only uniforms
uint tid [[thread_position_in_grid]]
) {
if (tid >= params.particleCount) return;
Particle p = oldParticles[tid]; // read from device memory
// Local arithmetic in register space (thread address space)
float3 newVelocity = p.velocity + float3(0, -params.gravity, 0) * params.dt;
float3 newPosition = p.position + newVelocity * params.dt;
float newLifetime = p.lifetime - params.dt;
// Write result to device memory
particles[tid].position = newPosition;
particles[tid].velocity = newVelocity;
particles[tid].lifetime = newLifetime;
}
The address spaces do real work here:
particlesisdevicebecause it needs to persist across frames and be writtenoldParticlesisconst devicebecause we only read it; theconstsignals intent and may improve cachingparamsisconstantbecause all threads read the same simulation parameters; binding it asconstantis faster thandevicefor broadcast readsp,newVelocity,newPosition,newLifetimeare local variables living in registers, costing nothing to access
The kernel reads from old state and writes to new state, keeping the buffers separate to avoid one thread reading data that another thread has already updated. This is the double buffering pattern that the PixelWave simulation used for its height field textures.
Ray data and object data
Two additional address spaces exist for specialized uses: ray_data for intersection functions in ray tracing pipelines, and object_data for mesh shaders in object mesh pipeline stages. These are advanced features that require specific hardware support and are worth their own treatment. For the purposes of learning MSL fundamentals, device, constant, threadgroup, and thread cover almost everything you will write.
Part 5
Address spaces describe where data lives. The next part is about how the hardware assigns the work: thread IDs, threadgroup sizes, and the dispatch model that connects your Swift code to the kernel executing on thousands of threads simultaneously.
Read the rest of the series
- Part 1: The Machine That Thinks in Parallel
- Part 2: The Pipeline and the Three Functions
- Part 3: Vectors, Matrices, and the Art of Swizzling
- Part 4: Address Spaces and Where Data Lives
- Part 5: Threads, Threadgroups, and the Dispatch Model
- Part 6: Textures, Samplers, and Reading Image Data
- Part 7: The Standard Library and Writing Real Shaders