Metal Shading Language, Part 5: Threads, Threadgroups, and the Dispatch Model

Parts 1 through 4 covered the mental model, the pipeline structure, the type system, and the memory layout. All of it builds toward a practical question: when you write a compute kernel, how does the GPU know which thread is doing which piece of work, and how does the thread know the same?

The answer is the thread ID system. It is the single most important mechanism for writing correct compute code, and its design forces you to think about the work your kernel does in spatial terms, as a grid of independent computations rather than a sequential loop.

What a dispatch is

On the Swift side, launching a compute kernel looks like this:

let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!

encoder.setComputePipelineState(pipelineState)
encoder.setBuffer(inputBuffer, offset: 0, index: 0)
encoder.setBuffer(outputBuffer, offset: 0, index: 1)

let gridSize = MTLSize(width: 1024, height: 1, depth: 1)
let threadgroupSize = MTLSize(width: 64, height: 1, depth: 1)
encoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadgroupSize)

encoder.endEncoding()
commandBuffer.commit()

This tells Metal: create a grid of 1024 x 1 x 1 threadgroups, each containing 64 x 1 x 1 threads. The total thread count is 1024 * 64 = 65,536 threads. Each one executes the kernel function once, simultaneously.

The GPU does not literally run all 65,536 threads in the same clock cycle. It runs them in SIMD groups of 32 or 64 (hardware dependent), scheduling those groups across its available execution units. From the shader's perspective, however, the model is exactly as described: each thread is an independent instance with its own position in the grid.

The three level thread hierarchy

The grid is three dimensional. Each dimension can be 1, allowing 1D or 2D grids when the problem is inherently 1D or 2D, or 3D for volumetric work:

Grid (total extent of the dispatch)
  └── Threadgroups (subdivisions of the grid)
        └── Threads (individual kernel invocations)

Each thread knows its position at all three levels through built in attributes:

kernel void my_kernel(
    // Position within the full grid (absolute)
    uint3 tid  [[thread_position_in_grid]],

    // Which threadgroup this thread belongs to
    uint3 tgid [[threadgroup_position_in_grid]],

    // Position within the threadgroup (local)
    uint3 lid  [[thread_position_in_threadgroup]],

    // Size of the threadgroup
    uint3 tgs  [[threads_per_threadgroup]],

    // Total size of the grid in threadgroups
    uint3 gs   [[threadgroups_per_grid]]
) { }

For a 1D problem, you declare uint instead of uint3 and the compiler selects the x component:

kernel void process_array(
    device float *data [[buffer(0)]],
    uint tid [[thread_position_in_grid]]
) {
    data[tid] = data[tid] * 2.0;
}

For a 2D problem (image processing):

kernel void process_image(
    texture2d<float, access::read> input [[texture(0)]],
    texture2d<float, access::write> output [[texture(1)]],
    uint2 tid [[thread_position_in_grid]]
) {
    float4 pixel = input.read(tid);
    output.write(float4(1.0 - pixel.rgb, pixel.a), tid);  // invert RGB
}

Choosing threadgroup size

The threadgroup size is the most consequential tuning decision in compute programming. It determines occupancy, SIMD group alignment, shared memory usage, and register pressure simultaneously. Getting it wrong does not produce incorrect results. It produces correct results at a fraction of possible throughput.

The SIMD group constraint. Apple GPU hardware executes threads in SIMD groups. The size of a SIMD group is hardware dependent; Apple GPUs use 32 threads per SIMD group. A threadgroup size that is not a multiple of the SIMD group size wastes hardware capacity: if your threadgroup has 48 threads, the hardware runs two SIMD groups (32 and 16 threads), but the second group's unused 16 threads still consume hardware resources. Prefer threadgroup sizes that are multiples of 32.

Common choices. For 1D kernels: 64, 128, or 256 threads per threadgroup. For 2D kernels: 16x16 (256 threads) or 8x8 (64 threads) are common. For 3D volumetric work: 4x4x4 (64 threads) or 8x4x2 (64 threads).

Querying the hardware. The pipeline state object knows the maximum threadgroup size for its kernel:

let maxTotalThreads = pipelineState.maxTotalThreadsPerThreadgroup
let threadExecutionWidth = pipelineState.threadExecutionWidth  // SIMD group size

The threadExecutionWidth is the SIMD group size for this specific pipeline state. Use it when you need the exact value rather than assuming 32.

Occupancy and register pressure. Larger threadgroups increase the number of in flight threads per execution unit, improving the GPU's ability to hide memory latency by switching to ready threads. But more threads per threadgroup means each thread has fewer registers before the compiler must spill to local memory, which is slower. And more threadgroup memory usage limits how many threadgroups the hardware can have in flight. These constraints push against each other. There is no universal answer, only empirical measurement.

Bounds checking

A practical problem: you often dispatch more threads than you have data elements to process. Threadgroup sizes are multiples of 32. Data sizes are arbitrary. A buffer of 1000 floats processed with 64 thread threadgroups requires 16 threadgroups (1024 threads) to cover it, but threads 1000 through 1023 have no corresponding data.

Always check bounds:

kernel void process(
    device float *data [[buffer(0)]],
    constant uint &count [[buffer(1)]],
    uint tid [[thread_position_in_grid]]
) {
    if (tid >= count) return;  // out-of-bounds threads do nothing
    data[tid] = sqrt(data[tid]);
}

The alternative, dispatchThreads(_:threadsPerThreadgroup:), handles this automatically by dispatching exactly as many threads as you specify, handling the remainder threadgroup internally:

// This dispatches exactly count threads, no bounds check needed in shader
encoder.dispatchThreads(
    MTLSize(width: count, height: 1, depth: 1),
    threadsPerThreadgroup: MTLSize(width: 64, height: 1, depth: 1)
)

The simulator and some older hardware do not support non uniform threadgroup sizes. dispatchThreadgroups(_:threadsPerThreadgroup:) with manual bounds checking is the safe fallback for maximum compatibility. This is the exact issue the PixelWave simulation encountered on the iOS Simulator, as described in the building PixelWave post.

The threadgroup as a cooperation unit

Threads within a threadgroup can cooperate through shared memory and synchronize with barriers. Threads in different threadgroups cannot communicate during a single dispatch. This boundary structures the design of parallel algorithms.

Consider a reduction: computing the sum of one million floats. A naive approach assigns one thread to each element and tries to sum them all. But how does thread 0 sum the results of threads 1 through 999,999? It cannot wait for them within the same dispatch. The typical pattern is hierarchical: each threadgroup computes a partial sum, writes it to device memory, and a subsequent dispatch reduces those partial sums to a final result.

kernel void partial_sum(
    device const float *input [[buffer(0)]],
    device float *partials [[buffer(1)]],
    threadgroup float *local [[threadgroup(0)]],
    uint tid [[thread_position_in_grid]],
    uint lid [[thread_position_in_threadgroup]],
    uint tgid [[threadgroup_position_in_grid]],
    uint tgSize [[threads_per_threadgroup]]
) {
    // Bounds-safe load: zero-pad out-of-range threads
    uint count = /* ... passed as constant */;
    local[lid] = (tid < count) ? input[tid] : 0.0;
    threadgroup_barrier(mem_flags::mem_threadgroup);

    // Tree reduction within the threadgroup
    for (uint stride = tgSize >> 1; stride > 0; stride >>= 1) {
        if (lid < stride) {
            local[lid] += local[lid + stride];
        }
        threadgroup_barrier(mem_flags::mem_threadgroup);
    }

    // Threadgroup leader writes the partial sum
    if (lid == 0) {
        partials[tgid.x] = local[0];
    }
}

After this kernel runs, partials has one value per threadgroup. A second dispatch reduces partials to a single final sum.

The two pass structure is the canonical answer to the "how do threadgroups communicate" question. They do not, within a single pass. Across passes, they communicate through device memory, with the CPU encoding the sequence of dispatches.

SIMD group functions

Within a SIMD group (typically 32 threads), Metal provides a set of intrinsic functions that operate across threads without explicit barriers. These are faster than threadgroup barrier based approaches because the hardware executes SIMD groups synchronously by definition.

kernel void simd_reduce(
    device const float *input [[buffer(0)]],
    device float *output [[buffer(1)]],
    uint tid [[thread_position_in_grid]],
    uint simdLane [[thread_index_in_simdgroup]],
    uint simdGroupID [[simdgroup_index_in_threadgroup]]
) {
    float value = input[tid];

    // Sum all values within the SIMD group, no barrier needed
    float simdSum = simd_sum(value);

    // Only lane 0 writes the result for this SIMD group
    if (simdLane == 0) {
        output[simdGroupID] = simdSum;
    }
}

simd_sum() is a built in that sums a value across all active lanes in the SIMD group. Similar functions exist for other reductions:

float result = simd_sum(value);      // sum
float result = simd_product(value);  // product
float result = simd_min(value);      // minimum
float result = simd_max(value);      // maximum
bool result  = simd_any(condition);  // logical or
bool result  = simd_all(condition);  // logical and

Permute functions let threads read each other's values by lane index:

float neighbor = simd_shuffle(value, (simdLane + 1) % 32);  // read from next lane
float broadcast = simd_broadcast(value, 0);                  // lane 0's value to all

SIMD group functions are the fastest form of inter thread communication available. When your algorithm can be expressed within a single SIMD group, prefer them over threadgroup shared memory and barriers.

Building intuition: the dispatch as a mapping

The most useful way to think about a compute dispatch is as a map from a grid of thread IDs to a grid of outputs. Your kernel is the function in that mapping. The thread ID is the input. The writes your kernel performs are the output. The GPU evaluates the mapping for every point in the grid simultaneously.

Most kernels follow the same shape: compute the index of the data this thread is responsible for (from the thread ID and the data layout), check bounds if necessary, read inputs from device or threadgroup memory, perform the computation in local register space, then write output to device memory. Image filters, physics simulations, neural network layers, sorting algorithms all reduce to this once you strip away their specifics.

A complete working example

A 2D image inversion kernel, fully self contained:

#include <metal_stdlib>
using namespace metal;

kernel void invert_image(
    texture2d<float, access::read>  source [[texture(0)]],
    texture2d<float, access::write> dest   [[texture(1)]],
    uint2 tid [[thread_position_in_grid]]
) {
    // Bounds check: do not process pixels outside the texture
    if (tid.x >= source.get_width() || tid.y >= source.get_height()) {
        return;
    }

    float4 pixel = source.read(tid);
    float4 inverted = float4(1.0 - pixel.rgb, pixel.a);  // preserve alpha
    dest.write(inverted, tid);
}

From Swift:

let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(invertPipeline)
encoder.setTexture(sourceTexture, index: 0)
encoder.setTexture(destTexture, index: 1)

let w = invertPipeline.threadExecutionWidth
let h = invertPipeline.maxTotalThreadsPerThreadgroup / w
let threadgroupSize = MTLSize(width: w, height: h, depth: 1)
let gridSize = MTLSize(
    width: (sourceTexture.width  + w - 1) / w,
    height: (sourceTexture.height + h - 1) / h,
    depth: 1
)
encoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()

The threadgroup size comes from the pipeline state's own knowledge of what works best for this kernel. The grid size is computed to cover the full texture with an integer number of threadgroups, with bounds checking in the shader to handle the edges.

Part 6

The dispatch model explains how threads are organized. The next part covers the resource side: textures and samplers, the types that connect your shaders to image data, and the sample based texture access model that underlies nearly all rendering.