Metal Shading Language, Part 3: Vectors, Matrices, and the Art of Swizzling

A CPU program works primarily with scalars. You have an Int, a Double, a Bool. Composite data structures exist but the language treats them as constructions built from scalar primitives. The fundamental unit of computation is one value being operated on by one instruction.

GPU hardware is different at the arithmetic level. The execution units are designed to operate on groups of values simultaneously. A single instruction can add two float4 vectors, performing four floating point additions in one cycle, because the silicon was built with that width. MSL's type system reflects this directly. Vectors and matrices are first class primitive types with dedicated syntax, dedicated operators, and hardware instructions that correspond to them one to one.

Scalar types

MSL's scalar types are a restricted version of C++'s:

Type	Width	Notes
`bool`	1 byte	`true` / `false`
`char` / `int8_t`	1 byte	Signed 8-bit
`uchar` / `uint8_t`	1 byte	Unsigned 8-bit
`short` / `int16_t`	2 bytes	Signed 16-bit
`ushort` / `uint16_t`	2 bytes	Unsigned 16-bit
`int` / `int32_t`	4 bytes	Signed 32-bit
`uint` / `uint32_t`	4 bytes	Unsigned 32-bit
`long` / `int64_t`	8 bytes	Metal 2.2+
`half`	2 bytes	IEEE 754 binary16
`float`	4 bytes	IEEE 754 single precision
`bfloat`	2 bytes	Brain float, Metal 3.1+

Notice what is absent. No double. No long double. The 64 bit floating point types that CPU programmers reach for reflexively are unavailable in most shader stages. GPU hardware historically lacks native 64 bit float support in its execution units. The language does not pretend otherwise.

half deserves particular attention. The 16-bit float trades precision for throughput: on Apple silicon, many operations on half values run at twice the rate of the same operations on float. When you are processing millions of pixels per frame and precision beyond a few decimal places is visually indistinguishable, the tradeoff is almost always worth taking. The Apple GPU Performance Guidelines recommend preferring half in fragment shaders wherever the precision is sufficient.

Literal suffixes:

float a = 1.5f;   // or 1.5F
half  b = 1.5h;   // or 1.5H

Without the suffix, a floating point literal is float. Not double, unlike in C++. The default is appropriate for the hardware.

Vector types

For every scalar type, MSL provides 2-, 3-, and 4-component vector variants:

float2  // two floats
float3  // three floats
float4  // four floats

half4   // four halves
int2    // two ints
uint3   // three unsigned ints
bool4   // four bools

These types are the daily currency of graphics programming. A position in space is a float3 or float4. A color is a float4 (RGBA). A texture coordinate is a float2. A screen space pixel coordinate is a uint2.

Constructing vectors

float4 a = float4(1.0, 2.0, 3.0, 4.0);  // component by component
float4 b = float4(0.0);                   // all components set to 0.0
float2 xy = float2(1.0, 2.0);
float4 c = float4(xy, 3.0, 4.0);         // combine smaller vectors with scalars
float4 d = float4(xy, float2(3.0, 4.0)); // combine two float2s

Metal consumes constructor arguments left to right, filling components in order. The total number of scalar values across all arguments must exactly match the vector's component count. Under initializing is a compile error.

Accessing components

Two notations exist, and they are interchangeable:

float4 v = float4(1.0, 2.0, 3.0, 4.0);

// Coordinate notation
float x = v.x;  // 1.0
float y = v.y;  // 2.0
float z = v.z;  // 3.0
float w = v.w;  // 4.0

// Color notation
float r = v.r;  // 1.0
float g = v.g;  // 2.0
float b = v.b;  // 3.0
float a = v.a;  // 4.0

// Array notation
float first = v[0];  // 1.0

You cannot mix .xyzw and .rgba in a single access. The compiler rejects it. Pick one notation per expression and stay consistent.

Swizzling

Swizzling is the ability to select and rearrange multiple components in a single expression. It is not a function call. It is syntax built into the language, and it maps to efficient hardware shuffle instructions.

float4 v = float4(1.0, 2.0, 3.0, 4.0);

float2 xy  = v.xy;    // (1.0, 2.0)
float3 zyx = v.zyx;   // (3.0, 2.0, 1.0), reversed
float4 xxzz = v.xxzz; // (1.0, 1.0, 3.0, 3.0), duplicated components
float3 www = v.www;   // (4.0, 4.0, 4.0), broadcast one component

You can read the same component multiple times in a swizzle (useful for broadcasting a scalar into a vector). You cannot write to the same component twice in an assignment (ambiguous; the compiler rejects it):

// Legal: read x twice
float4 broadcast = float4(v.x);  // (1.0, 1.0, 1.0, 1.0)

// Illegal: write x twice
v.xx = float2(5.0, 6.0);  // compile error

Swizzles work as lvalues for assignment:

float4 pos = float4(1.0, 2.0, 3.0, 4.0);
pos.xw = float2(9.0, 8.0);  // pos is now (9.0, 2.0, 3.0, 8.0)
pos.yz = float2(7.0, 6.0);  // pos is now (9.0, 7.0, 6.0, 8.0)

The components do not have to be in order:

pos.wx = float2(0.0, 1.0);  // w gets 0.0, x gets 1.0

This is used constantly in real shader code. Converting between color formats, extracting depth from a depth stencil texture, constructing a normal vector from separate channels, packing multiple values into a single vector for output, all of it uses swizzle syntax. A shader that avoids swizzling is a shader that is fighting the language.

Arithmetic on vectors

The arithmetic operators work component wise on vectors of the same type:

float3 a = float3(1.0, 2.0, 3.0);
float3 b = float3(4.0, 5.0, 6.0);

float3 sum = a + b;   // (5.0, 7.0, 9.0)
float3 diff = a - b;  // (-3.0, -3.0, -3.0)
float3 prod = a * b;  // (4.0, 10.0, 18.0), component wise, not dot product
float3 quot = a / b;  // (0.25, 0.4, 0.5)

Scalar vector arithmetic broadcasts the scalar across all components:

float3 scaled = a * 2.0;  // (2.0, 4.0, 6.0)
float3 offset = a + 1.0;  // (2.0, 3.0, 4.0)

Note that a * b for vectors is elementwise multiplication, not dot product. The standard library provides dot(a, b) for that. The distinction matters. Using * when you meant dot() produces wrong geometry that compiles and runs without errors, and the bug will appear as subtly incorrect lighting or projection.

The standard library math built into the language

MSL's standard library is included with #include <metal_stdlib>. It provides the functions you use in practically every shader:

Geometric functions:

float d = dot(a, b);         // dot product: sum of component products
float3 n = cross(a, b);      // cross product (float3 only)
float len = length(a);       // Euclidean length
float3 unit = normalize(a);  // unit vector in direction of a
float dist = distance(a, b); // equivalent to length(a - b)

Interpolation:

float3 lerp = mix(a, b, 0.5);             // linear interpolation: a + (b-a)*t
float s = smoothstep(0.0, 1.0, t);        // smooth Hermite interpolation
float c = clamp(value, 0.0, 1.0);         // clamp to range
float s = saturate(value);               // clamp to [0, 1], equivalent to clamp(v, 0, 1)

Component wise selection:

float3 m = min(a, b);    // component wise minimum
float3 x = max(a, b);    // component wise maximum
float3 abs_a = abs(a);   // component wise absolute value
float3 f = floor(a);     // component wise floor
float3 c = ceil(a);      // component wise ceiling
float3 fr = fract(a);    // fractional part: a - floor(a)

Math:

float s = sin(angle);
float c = cos(angle);
float p = pow(base, exp);
float sq = sqrt(value);
float r = rsqrt(value);  // reciprocal square root: 1/sqrt(value), often faster
float e = exp(x);
float l = log(x);

rsqrt is worth dwelling on. Normalizing a vector requires dividing by its length. Division is expensive. The reciprocal square root, computed directly from the magnitude squared, avoids the division:

// These are equivalent, but the second is faster
float3 slow = a / length(a);
float3 fast = a * rsqrt(dot(a, a));

The hardware has a native reciprocal square root instruction. rsqrt maps to it directly.

Matrix types

MSL provides matrix types with notation floatNxM, where N is the number of columns and M is the number of rows:

float2x2  // 2 columns, 2 rows
float3x3  // 3 columns, 3 rows
float4x4  // 4 columns, 4 rows
float3x4  // 3 columns, 4 rows (less common)

Matrices are constructed column by column:

// A 2x2 identity matrix
float2x2 identity = float2x2(
    float2(1.0, 0.0),  // first column
    float2(0.0, 1.0)   // second column
);

// A 4x4 translation matrix
float4x4 translation = float4x4(
    float4(1.0, 0.0, 0.0, 0.0),  // column 0
    float4(0.0, 1.0, 0.0, 0.0),  // column 1
    float4(0.0, 0.0, 1.0, 0.0),  // column 2
    float4(tx,  ty,  tz,  1.0)   // column 3
);

Matrix vector multiplication uses the * operator, but the convention matters. MSL uses column major matrices, and the standard operation for transforming a column vector is:

float4 position = modelViewProjection * float4(worldPos, 1.0);

This is the form you will see in almost every vertex shader. The float4(worldPos, 1.0) appends a 1.0 to convert a 3D position into a homogeneous vector, then multiplies by the combined MVP matrix to produce a clip space position.

Matrix matrix multiplication:

float4x4 mvp = projection * view * model;

The transpose() and determinant() functions are in the standard library. inverse() is notably absent in some contexts; for normal matrix computation, the inverse transpose of the model matrix is usually precomputed on the CPU and passed as a uniform, not computed per vertex in the shader.

Column access:

float4x4 m = /* ... */;
float4 col0 = m[0];  // first column
float  element = m[1][2];  // column 1, row 2

Packed vector types

Standard vector types are aligned to their size. A float3 is 16-byte aligned, occupying the same memory as a float4 with its fourth component unused. This alignment is required for the hardware but wastes memory when you pack vertex data tightly.

When vertex data arrives from a CPU side buffer where positions, normals, and UV coordinates are packed contiguously without padding, you use packed vector types:

packed_float2  // 8 bytes, 4-byte aligned
packed_float3  // 12 bytes, 4-byte aligned (no padding)
packed_float4  // 16 bytes, 4-byte aligned

struct Vertex {
    packed_float3 position;
    packed_float3 normal;
    packed_float2 uv;
};

A packed vertex structure matches the memory layout the CPU side produces naturally. The shader reads the packed representation and can convert to the aligned type for arithmetic:

float3 pos = float3(vertex.position);  // packed_float3 to float3

Arithmetic on packed types is slower than on aligned types on some hardware. Read from packed, compute with aligned, is the pattern.

Type conversions

Implicit conversions between numeric types exist but follow narrower rules than in C++. Converting from int to float is implicit. Converting from float to int requires an explicit cast. Converting between vector types of different widths requires explicit construction.

int i = 5;
float f = float(i);    // explicit, fine
float g = i;           // implicit promotion, also fine

float4 v4 = float4(1.0, 2.0, 3.0, 4.0);
float3 v3 = float3(v4.xyz);  // explicit construction from swizzle
float3 v3b = v4.xyz;         // also fine, swizzle result is already float3

Reinterpreting the bit pattern of one type as another uses as_type<T>():

float f = 1.0f;
uint bits = as_type<uint>(f);  // IEEE 754 bits of 1.0f, which is 0x3F800000

This is useful for packing data, computing float based hashing tricks, and debugging bit patterns. It does not perform any arithmetic conversion. The bits transfer unchanged.

Boolean vectors and comparison

Comparison operators on vectors return bool vectors:

float4 a = float4(1.0, 2.0, 3.0, 4.0);
float4 b = float4(1.0, 3.0, 2.0, 4.0);

bool4 equal = a == b;  // (true, false, false, true)
bool4 greater = a > b; // (false, false, true, false)

The any() and all() functions reduce a bool vector to a scalar:

bool anyTrue = any(equal);   // true, because at least one component is true
bool allTrue = all(equal);   // false, not all components are true

The select() function is the vector equivalent of the ternary operator:

// select(false_value, true_value, condition)
float4 result = select(a, b, equal);  // (a.x, b.y, b.z, b.w)
// where equal is (true, false, false, true):
// result.x = b.x (condition true), result.y = a.y (condition false), etc.
// Wait: select picks b where condition is true, a where false
// result = (1.0, 2.0, 2.0, 4.0)

select is preferable to branching in shaders because it avoids thread divergence. Both branches compute their values, and the condition selects which result to keep, all within a single instruction without splitting the SIMD group.

What this type system is for

Every type in this chapter serves the same purpose: enabling the GPU to operate on groups of values in single instructions. When you write float4 color = albedo * lighting, four multiplications happen at once in the same number of cycles as one. When you write float3 n = normalize(normal), the dot product, square root, and division happen in vector instructions across all three components simultaneously.

The type system is the direct interface to the parallelism that makes the hardware fast.

The programmer who reaches for a float when they need a float3 and then computes three separate multiplications has written correct code that runs at a fraction of the available throughput. The programmer who understands the type system writes the same operation in one line.

Part 4 covers address spaces: the explicit annotations that tell the GPU where data lives, which determines how fast you can access it and who else can touch it.