Metal Shading Language, Part 2: The Pipeline and the Three Functions

Part 1: The Machine That Thinks in Parallel established what the GPU is and why shader code does not behave like the code you ordinarily write. This part is about structure: the specific stages where your Metal code lives, what the hardware hands you at each one, and what it expects back.

Three function qualifiers exist in MSL. vertex, fragment, kernel. Each designates a function for a different position in the GPU's execution model. They are not interchangeable, and confusing them produces either a compiler error or a runtime crash with a validation message that tells you exactly nothing useful.

The render pipeline

Rendering geometry to a screen follows a fixed sequence. Some stages you control with code. Others are handled entirely by fixed function hardware that you configure but do not program.

The full sequence for a draw call:

Vertex buffer (your geometry data, on GPU)
       |
       v
  [Vertex Function]      ← you write this
       |
       v
  [Primitive Assembly]   ← fixed function
       |
       v
  [Rasterization]        ← fixed function
       |
       v
  [Fragment Function]    ← you write this
       |
       v
  [Depth/Stencil Test]   ← fixed function
       |
       v
  [Blending]             ← fixed function
       |
       v
  [Render Target]        ← your framebuffer

You write two shaders. The GPU runs everything between them automatically.

The vertex function

The vertex function runs once per vertex in your draw call. Its sole responsibility is to transform a vertex from model space into clip space, the coordinate system the rasterizer understands.

The minimal vertex function:

#include <metal_stdlib>
using namespace metal;

struct VertexOut {
    float4 position [[position]];
};

vertex VertexOut basic_vertex(
    uint vid [[vertex_id]],
    constant float2 *positions [[buffer(0)]]
) {
    VertexOut out;
    out.position = float4(positions[vid], 0.0, 1.0);
    return out;
}

Several things are happening here.

[[vertex_id]] is a built in attribute. The hardware provides the index of the current vertex, counting from zero, automatically. You declare it as a function parameter and the GPU fills it in.

[[buffer(0)]] locates the argument in Metal's resource binding table. Your Swift code binds a buffer to slot 0 when encoding the draw call. The shader declares it needs something at slot 0, and the Metal runtime connects them.

The return type is a struct with a member tagged [[position]]. This attribute is mandatory in the vertex function's output. It tells the hardware which field contains the clip space position. The rasterizer reads this field to determine which fragments to generate. The other fields of the struct are passed through to the fragment function as interpolated values.

Clip space uses a four component homogeneous vector. The rasterizer divides xyz by w to produce normalized device coordinates (NDC). In Metal's NDC system, the visible volume runs from -1 to +1 in x and y, 0 to 1 in z. Anything outside that box gets clipped and no fragment is generated for it.

What the vertex function must not do

It must not write to shared state that other vertex invocations read. Each vertex runs independently. There is no mechanism for one vertex shader to communicate with another during the same draw call. Every vertex you process is processed in isolation.

It must return a [[position]]. Without it, the rasterizer has no geometry to process. The compiler enforces this.

Between vertex and fragment: what the rasterizer does

The rasterizer is not code you write, but it determines what your fragment function receives.

Given the clip space positions output by the vertex function, the rasterizer:

Assembles triangles (or lines, or points) from consecutive vertices
Clips geometry that extends outside the NDC box
Computes screen space coordinates for all fragments inside the triangle
Interpolates the non position outputs of the vertex function across the triangle's surface

That last step is important. Every field in your vertex output struct, aside from [[position]], gets linearly interpolated across the triangle before the fragment function receives it. If your vertex shader outputs a color, the fragment shader receives a smoothly blended color at each pixel's position on the triangle, not the raw vertex values.

This interpolation is the rasterizer doing perspective correct barycentric interpolation. You do not calculate it. The hardware does. You just declare the fields you want interpolated, and they arrive at the fragment function already blended.

The fragment function

The fragment function runs once per fragment, where a fragment is roughly equivalent to a pixel covered by the geometry you are drawing. It receives the interpolated outputs of the vertex function and returns a color (or colors, if you are drawing to multiple render targets simultaneously).

fragment float4 basic_fragment(
    VertexOut in [[stage_in]]
) {
    return float4(1.0, 0.5, 0.0, 1.0); // solid orange
}

[[stage_in]] is the attribute that marks the struct receiving interpolated vertex outputs. The type must match the return type of the vertex function. The Metal compiler validates this matching at pipeline creation time.

A fragment function that uses the interpolated data:

struct VertexOut {
    float4 position [[position]];
    float2 uv;
    float3 color;
};

fragment float4 textured_fragment(
    VertexOut in [[stage_in]],
    texture2d<float> albedo [[texture(0)]],
    sampler textureSampler [[sampler(0)]]
) {
    float4 texColor = albedo.sample(textureSampler, in.uv);
    return float4(texColor.rgb * in.color, texColor.a);
}

The in.uv and in.color fields arrived interpolated across the triangle. The fragment function samples a texture at the interpolated UV coordinate and modulates it by the interpolated per vertex color. This is the basis of virtually all real time 3D rendering.

What the fragment function can do that the vertex function cannot

Fragment functions can use derivative functions: dfdx(), dfdy(), fwidth(). These compute the rate of change of any value across adjacent fragments, which is how the GPU calculates which mip level to sample from a texture and how edge anti aliasing detects geometry edges. Derivatives work because adjacent fragments in a SIMD group execute simultaneously, so the hardware can compute differences between neighboring thread values directly.

Fragment functions can discard a fragment entirely with discard_fragment(). When called, the hardware does not write the fragment's output to the render target. This implements effects like punch through alpha transparency, where pixels whose alpha falls below a threshold are simply skipped.

Fragment functions can output depth explicitly with [[depth(any)]], overriding the hardware's depth interpolation. This supports techniques like parallax occlusion mapping, where the apparent depth of a surface differs from its geometric depth.

The compute function (kernel)

The kernel is the most general purpose of the three. It does not belong to a render pipeline at all. It runs in a compute pipeline, outside the vertex fragment rasterizer sequence entirely.

kernel void fill_buffer(
    device float *output [[buffer(0)]],
    uint tid [[thread_position_in_grid]]
) {
    output[tid] = float(tid) * 0.01;
}

A kernel function returns void. Always. The output is not a return value but a write into a buffer or texture that the kernel receives as an argument. The device qualifier marks memory in the GPU's device address space (more on address spaces in Part 4).

[[thread_position_in_grid]] is the kernel's equivalent of [[vertex_id]]: it tells each thread its unique position within the dispatch grid. Unlike vertex and fragment functions, which are implicitly sized to the geometry being drawn, a compute kernel's grid size is specified explicitly from Swift when you encode the dispatch:

let gridSize = MTLSize(width: 1024, height: 1, depth: 1)
let threadGroupSize = MTLSize(width: 64, height: 1, depth: 1)
encoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadGroupSize)

The kernel will execute with 1024 * 1 * 1 total threads, arranged in groups of 64. Each thread receives a unique thread_position_in_grid from 0 to 1023.

Kernels are what you use for physics simulation, image processing, machine learning inference, and any computation that maps cleanly to "run this function on every element in a collection."

The PixelWave simulation from an earlier post on this site used a kernel for exactly this: the same wave equation update applied to every cell in a 2D height grid, 120 times per second, all cells simultaneously.

Linking vertex and fragment functions

Metal does not assume your vertex and fragment functions go together. You compose them into a render pipeline state object from Swift, and that composition is where the compiler validates their compatibility.

let descriptor = MTLRenderPipelineDescriptor()
descriptor.vertexFunction = library.makeFunction(name: "basic_vertex")
descriptor.fragmentFunction = library.makeFunction(name: "basic_fragment")
descriptor.colorAttachments[0].pixelFormat = .bgra8Unorm

let pipelineState = try device.makeRenderPipelineState(descriptor: descriptor)

At the moment makeRenderPipelineState runs, Metal validates:

The vertex function's output struct fields match the fragment function's [[stage_in]] struct
The pixel formats are compatible with the return types
The resource bindings are consistent

Mismatches surface as errors here, at pipeline creation, not during the draw call. This is by design. You want to catch incompatibilities during initialization, not mid frame.

What each function type sees and cannot see

	Vertex	Fragment	Kernel
Runs for	Each vertex	Each fragment (pixel)	Each thread in grid
Receives	Vertex buffers, textures, samplers, constants	Interpolated vertex outputs, textures, samplers, constants	Buffers, textures, samplers, threadgroup memory
Returns	Clip space position + interpolated data	Color(s), optional depth	void (writes to output buffers)
Can use derivatives	No	Yes	No
Can discard	No	Yes	No
Threadgroup sync	No	Limited	Yes

The restrictions follow from the execution stage. Derivatives require adjacent parallel threads, which only exist in the fragment stage. Discard is a framebuffer operation. Threadgroup barriers block threads in the same group from proceeding until all threads reach the barrier, which requires knowing who is in the group, which is well defined in compute but not in fragment (fragment SIMD group boundaries are hardware determined).

The attribute syntax

The [[attribute_name]] syntax appears throughout MSL and annotates function parameters, return values, and struct members. Attributes serve as the communication channel between your code and the GPU pipeline stages.

You have already seen several:

[[position]]: clip space position output from vertex, screen space position input to fragment
[[vertex_id]]: current vertex index in the draw call
[[stage_in]]: marks the struct receiving interpolated vertex outputs in the fragment function
[[buffer(n)]]: binds the parameter to buffer slot n
[[texture(n)]]: binds to texture slot n
[[sampler(n)]]: binds to sampler slot n
[[thread_position_in_grid]]: absolute thread position in a compute dispatch

There are many more. Intersection function inputs, mesh shader outputs, object function attributes. The full list is in the Metal Shading Language Specification and grows with each Metal version. The pattern is consistent: declare what you need, annotate it with the appropriate attribute, and the GPU fills it in.

The next parts

The pipeline gives you the structure. Now you need the language that runs inside it.

Part 3 covers MSL's type system: the scalar types, the vector and matrix types that are the language's real working tools, and the swizzle syntax that makes operating on components intuitive. Part 4 goes into address spaces, the mechanism that determines where data lives on the GPU and who can access it.

The pattern that emerges across all of this is consistent: Metal makes explicit what other programming models leave implicit. Where does this data live? Who can read it? Which execution stage am I in? That explicitness is the GPU telling you the truth about itself.