Metal Shading Language, Part 1: The Machine That Thinks in Parallel

There is a moment, early in every programmer's encounter with a GPU, where confidence cracks. You know loops. You know functions. You know that code runs top to bottom, one instruction after the next, and that variables live somewhere you can point to. Then you look at a shader and nothing quite lines up, and the discomfort you feel is not ignorance. It is the correct sensation of a mental model running out of room.

This series is about Metal Shading Language. Not the API surface, not the Swift wrappers, not the pipeline descriptors and command buffers. The language itself. The thing that runs on the chip. To read it without confusion, you first have to understand the machine it runs on, which is a different kind of machine than the one you have been programming.

The CPU as you know it

A CPU is a serial executor wearing a disguise of speed. Modern CPUs have anywhere from four to thirty two cores, but each core runs one thread at a time, and the impression of parallelism comes from context switching so fast you cannot perceive the seams. The hardware is optimized for low latency on complex, branchy, unpredictable work. A database query. A JSON parse. A user event handler. Tasks that depend on previous results, that branch a hundred times per function, that touch memory in patterns no predictor could anticipate.

The CPU is a virtuoso playing a solo. It handles one thing, then the next, and it handles each one with maximum skill and minimum delay.

The GPU as it actually is

A GPU is a different machine built for a different problem than a CPU.

An Apple A18 Pro GPU has several thousand shader cores. Each core is slower and simpler than a CPU core. It cannot do much. What it can do is run the same program on a massive number of data elements simultaneously. Not sequentially. Not context switching between them. Simultaneously, in the same clock cycle, across thousands of inputs at once.

This is the central fact. A fragment shader that computes the color of a pixel does not run once per frame. It runs once per pixel per frame, with every pixel executing the shader concurrently. A 2,556 x 1,179 display has roughly three million pixels. The GPU evaluates three million instances of your shader at the same time.

The hardware is optimized for throughput on regular, predictable work. Identical operations across independent data. Graphics pipelines. Physics simulations. Neural network layers. Machine learning inference. Any problem where the same calculation needs to happen to millions of things that do not need to wait for each other.

The GPU is ten thousand soldiers marching in lockstep. Each one is less capable than the CPU's virtuoso. Together, they move things no virtuoso could.

Why this matters for writing shaders

Every strange decision in MSL traces back to this constraint. When you write a function for a CPU, you are writing for one execution. When you write a shader, you are writing a template for thousands of simultaneous executions, each operating on different data, all running the same code path.

This has consequences.

No shared mutable state between threads. Each shader invocation has its own registers, its own local variables, its own stack. One pixel shader cannot read another pixel shader's intermediate values mid execution. There is no synchronized shared heap the way you might reach for in Swift. Sharing data between threads requires explicit address space annotations and explicit synchronization primitives, and even then, only within a threadgroup.

Branching is expensive in a specific way. GPU cores execute in SIMD groups, clusters of typically 32 threads that share the same instruction pointer. When threads within a group diverge at an if statement, the hardware executes both branches and masks the irrelevant results. Every divergent branch makes the group do more work. The cost shows up as wasted execution on paths the thread does not take, which is different from the branch prediction miss you know from CPU programming.

Memory access patterns matter more. A CPU has a sophisticated cache hierarchy and prefetcher tuned to detect and accommodate irregular access patterns. A GPU prefers coalesced memory access: adjacent threads reading adjacent memory locations, so the hardware can satisfy many reads with a single wide fetch. Scatter your reads unpredictably and throughput collapses.

Latency hiding replaces latency minimization. A CPU core stalls when it hits a cache miss. A GPU core, when waiting on memory, switches to a different set of threads in the same core and runs those until the memory comes back. Hiding latency requires having enough in flight work. This is why GPU programmers think about occupancy, the fraction of the hardware's thread slots that are filled, as a primary lever.

All of it directly shapes the language you write.

The execution model in MSL

Metal Shading Language names the units of execution explicitly, and you will see these names on every function signature you write.

A thread is one instance of your shader. One pixel. One vertex. One element of your compute grid. It has its own local variables and registers.

A threadgroup (called a workgroup in other APIs) is a collection of threads that can share memory via threadgroup address space, and synchronize with each other using threadgroup_barrier. On Apple GPUs, a threadgroup maps to what the hardware calls a SIMD group or a collection of SIMD groups. For fragment and compute shaders, this is where cooperation between nearby invocations is possible.

A grid is the full set of threads dispatched for a single draw or dispatch call. For a fragment shader, the grid is the render target's dimensions. For a compute kernel, you specify the grid size when you encode the dispatch command from Swift.

The thread knows where it is within this hierarchy through built in input attributes. A compute kernel might receive:

kernel void myKernel(
    uint2 tid [[thread_position_in_grid]],
    uint2 tgid [[threadgroup_position_in_grid]],
    uint2 lid [[thread_position_in_threadgroup]]
) {
    // tid: absolute position in the full grid
    // tgid: which threadgroup am I in?
    // lid: my position within my threadgroup
}

These are not function arguments you pass from Swift. They are injected by the hardware. Your shader declares which built in values it needs, and the GPU fills them in for each invocation.

What MSL actually is

Metal Shading Language is a subset of C++14 with extensions for GPU programming. Not Swift. Not Objective-C. A C++ dialect that compiles to AIR (Apple Intermediate Representation) and then to GPU machine code, either ahead of time or at runtime through the Metal compiler toolchain.

The familiarity is intentional and useful. If you can read C++, you can read MSL. The differences are:

No recursion (GPUs do not have general purpose call stacks the way CPUs do)
No dynamic memory allocation (no new, no malloc, no heap as you know it)
No virtual functions, no exceptions, no RTTI
No double type (too wide for most GPU hardware)
Address spaces are explicit in type declarations
A rich set of vector and matrix types built into the language

What remains is a clean, fast language for data parallel computation with first class support for the math graphics and compute workloads require.

A first look at the shape of things

Here is a fragment shader that colors every pixel a solid red:

#include <metal_stdlib>
using namespace metal;

fragment float4 solid_red(
    float4 position [[position]]
) {
    return float4(1.0, 0.0, 0.0, 1.0);
}

The fragment qualifier tells the Metal compiler this function runs in the fragment stage of the render pipeline. The return type float4 is a four component vector of 32-bit floats, representing RGBA color. The [[position]] attribute on the input tells the hardware to provide the screen space coordinates of the pixel being shaded. The return value is the color the rasterizer writes to the render target.

This function runs for every pixel the rasterizer determines is covered by your geometry. If you draw a triangle that covers ten thousand pixels, this function executes ten thousand times, concurrently. Each invocation gets a different position. Each one returns a value. The GPU writes them all to the framebuffer without any of them waiting for any other.

That is the shape of shader programming. One function. Many simultaneous instances. Independent inputs, independent outputs, no shared mutable state between them.

The pipeline context

No shader runs in isolation. Metal's render pipeline is a sequence of programmable stages connected by fixed function hardware, and your shader code lives at specific positions in that sequence.

A full render pipeline, stripped to its skeleton:

[Vertex buffer]
      |
      v
[Vertex shader]   ← your code: runs once per vertex
      |
      v
[Rasterizer]      ← fixed function: converts triangles to fragments
      |
      v
[Fragment shader] ← your code: runs once per fragment (pixel)
      |
      v
[Render target]

A compute pipeline is simpler:

[Buffers / Textures]
         |
         v
[Compute kernel]  ← your code: runs once per thread in the grid
         |
         v
[Output buffers / Textures]

The CPU never touches the data between stages. It encodes commands into a command buffer, submits that buffer to the GPU, and then the GPU runs the entire pipeline asynchronously. Your Swift code and your Metal code execute on different processors, in different memory spaces, at different times. The command buffer is the message you send across that boundary.

Before the next part

The concepts in this post are prerequisites for everything that follows. The series continues with the mechanics of the pipeline in more detail, then dives into MSL's type system, its address spaces, compute kernels, the texture and sampler model, and the standard library functions you reach for constantly when writing real shaders.

Each part assumes you have read the parts before it. The confusion that strikes GPU programmers usually comes not from the language syntax, which is simple, but from carrying CPU assumptions into a machine built differently. The unlearning is the work.

The code comes after.