A CPU program works primarily with scalars. You have an Int, a Double, a Bool. Composite data structures exist but the language treats them as constructions built from scalar primitives. The fundamental unit of computation is one value being operated on by one instruction.
GPU hardware is different at the arithmetic level. The execution units are designed to operate on groups of values simultaneously. A single instruction can add two float4 vectors, performing four floating point additions in one cycle, because the silicon was built with that width. MSL's type system reflects this directly. Vectors and matrices are first class primitive types with dedicated syntax, dedicated operators, and hardware instructions that correspond to them one to one.
Scalar types
MSL's scalar types are a restricted version of C++'s:
| Type | Width | Notes |
|---|---|---|
bool | 1 byte | true / false |
char / int8_t | 1 byte | Signed 8-bit |
uchar / uint8_t | 1 byte | Unsigned 8-bit |
short / int16_t | 2 bytes | Signed 16-bit |
ushort / uint16_t | 2 bytes | Unsigned 16-bit |
int / int32_t | 4 bytes | Signed 32-bit |
uint / uint32_t | 4 bytes | Unsigned 32-bit |
long / int64_t | 8 bytes | Metal 2.2+ |
half | 2 bytes | IEEE 754 binary16 |
float | 4 bytes | IEEE 754 single precision |
bfloat | 2 bytes | Brain float, Metal 3.1+ |
Notice what is absent. No double. No long double. The 64 bit floating point types that CPU programmers reach for reflexively are unavailable in most shader stages. GPU hardware historically lacks native 64 bit float support in its execution units. The language does not pretend otherwise.
half deserves particular attention. The 16-bit float trades precision for throughput: on Apple silicon, many operations on half values run at twice the rate of the same operations on float. When you are processing millions of pixels per frame and precision beyond a few decimal places is visually indistinguishable, the tradeoff is almost always worth taking. The Apple GPU Performance Guidelines recommend preferring half in fragment shaders wherever the precision is sufficient.
Literal suffixes:
float a = 1.5f; // or 1.5F
half b = 1.5h; // or 1.5H
Without the suffix, a floating point literal is float. Not double, unlike in C++. The default is appropriate for the hardware.
Vector types
For every scalar type, MSL provides 2-, 3-, and 4-component vector variants:
float2 // two floats
float3 // three floats
float4 // four floats
half4 // four halves
int2 // two ints
uint3 // three unsigned ints
bool4 // four bools
These types are the daily currency of graphics programming. A position in space is a float3 or float4. A color is a float4 (RGBA). A texture coordinate is a float2. A screen space pixel coordinate is a uint2.
Constructing vectors
float4 a = float4(1.0, 2.0, 3.0, 4.0); // component by component
float4 b = float4(0.0); // all components set to 0.0
float2 xy = float2(1.0, 2.0);
float4 c = float4(xy, 3.0, 4.0); // combine smaller vectors with scalars
float4 d = float4(xy, float2(3.0, 4.0)); // combine two float2s
Metal consumes constructor arguments left to right, filling components in order. The total number of scalar values across all arguments must exactly match the vector's component count. Under initializing is a compile error.
Accessing components
Two notations exist, and they are interchangeable:
float4 v = float4(1.0, 2.0, 3.0, 4.0);
// Coordinate notation
float x = v.x; // 1.0
float y = v.y; // 2.0
float z = v.z; // 3.0
float w = v.w; // 4.0
// Color notation
float r = v.r; // 1.0
float g = v.g; // 2.0
float b = v.b; // 3.0
float a = v.a; // 4.0
// Array notation
float first = v[0]; // 1.0
You cannot mix .xyzw and .rgba in a single access. The compiler rejects it. Pick one notation per expression and stay consistent.
Swizzling
Swizzling is the ability to select and rearrange multiple components in a single expression. It is not a function call. It is syntax built into the language, and it maps to efficient hardware shuffle instructions.
float4 v = float4(1.0, 2.0, 3.0, 4.0);
float2 xy = v.xy; // (1.0, 2.0)
float3 zyx = v.zyx; // (3.0, 2.0, 1.0), reversed
float4 xxzz = v.xxzz; // (1.0, 1.0, 3.0, 3.0), duplicated components
float3 www = v.www; // (4.0, 4.0, 4.0), broadcast one component
You can read the same component multiple times in a swizzle (useful for broadcasting a scalar into a vector). You cannot write to the same component twice in an assignment (ambiguous; the compiler rejects it):
// Legal: read x twice
float4 broadcast = float4(v.x); // (1.0, 1.0, 1.0, 1.0)
// Illegal: write x twice
v.xx = float2(5.0, 6.0); // compile error
Swizzles work as lvalues for assignment:
float4 pos = float4(1.0, 2.0, 3.0, 4.0);
pos.xw = float2(9.0, 8.0); // pos is now (9.0, 2.0, 3.0, 8.0)
pos.yz = float2(7.0, 6.0); // pos is now (9.0, 7.0, 6.0, 8.0)
The components do not have to be in order:
pos.wx = float2(0.0, 1.0); // w gets 0.0, x gets 1.0
This is used constantly in real shader code. Converting between color formats, extracting depth from a depth stencil texture, constructing a normal vector from separate channels, packing multiple values into a single vector for output, all of it uses swizzle syntax. A shader that avoids swizzling is a shader that is fighting the language.
Arithmetic on vectors
The arithmetic operators work component wise on vectors of the same type:
float3 a = float3(1.0, 2.0, 3.0);
float3 b = float3(4.0, 5.0, 6.0);
float3 sum = a + b; // (5.0, 7.0, 9.0)
float3 diff = a - b; // (-3.0, -3.0, -3.0)
float3 prod = a * b; // (4.0, 10.0, 18.0), component wise, not dot product
float3 quot = a / b; // (0.25, 0.4, 0.5)
Scalar vector arithmetic broadcasts the scalar across all components:
float3 scaled = a * 2.0; // (2.0, 4.0, 6.0)
float3 offset = a + 1.0; // (2.0, 3.0, 4.0)
Note that a * b for vectors is elementwise multiplication, not dot product. The standard library provides dot(a, b) for that. The distinction matters. Using * when you meant dot() produces wrong geometry that compiles and runs without errors, and the bug will appear as subtly incorrect lighting or projection.
The standard library math built into the language
MSL's standard library is included with #include <metal_stdlib>. It provides the functions you use in practically every shader:
Geometric functions:
float d = dot(a, b); // dot product: sum of component products
float3 n = cross(a, b); // cross product (float3 only)
float len = length(a); // Euclidean length
float3 unit = normalize(a); // unit vector in direction of a
float dist = distance(a, b); // equivalent to length(a - b)
Interpolation:
float3 lerp = mix(a, b, 0.5); // linear interpolation: a + (b-a)*t
float s = smoothstep(0.0, 1.0, t); // smooth Hermite interpolation
float c = clamp(value, 0.0, 1.0); // clamp to range
float s = saturate(value); // clamp to [0, 1], equivalent to clamp(v, 0, 1)
Component wise selection:
float3 m = min(a, b); // component wise minimum
float3 x = max(a, b); // component wise maximum
float3 abs_a = abs(a); // component wise absolute value
float3 f = floor(a); // component wise floor
float3 c = ceil(a); // component wise ceiling
float3 fr = fract(a); // fractional part: a - floor(a)
Math:
float s = sin(angle);
float c = cos(angle);
float p = pow(base, exp);
float sq = sqrt(value);
float r = rsqrt(value); // reciprocal square root: 1/sqrt(value), often faster
float e = exp(x);
float l = log(x);
rsqrt is worth dwelling on. Normalizing a vector requires dividing by its length. Division is expensive. The reciprocal square root, computed directly from the magnitude squared, avoids the division:
// These are equivalent, but the second is faster
float3 slow = a / length(a);
float3 fast = a * rsqrt(dot(a, a));
The hardware has a native reciprocal square root instruction. rsqrt maps to it directly.
Matrix types
MSL provides matrix types with notation floatNxM, where N is the number of columns and M is the number of rows:
float2x2 // 2 columns, 2 rows
float3x3 // 3 columns, 3 rows
float4x4 // 4 columns, 4 rows
float3x4 // 3 columns, 4 rows (less common)
Matrices are constructed column by column:
// A 2x2 identity matrix
float2x2 identity = float2x2(
float2(1.0, 0.0), // first column
float2(0.0, 1.0) // second column
);
// A 4x4 translation matrix
float4x4 translation = float4x4(
float4(1.0, 0.0, 0.0, 0.0), // column 0
float4(0.0, 1.0, 0.0, 0.0), // column 1
float4(0.0, 0.0, 1.0, 0.0), // column 2
float4(tx, ty, tz, 1.0) // column 3
);
Matrix vector multiplication uses the * operator, but the convention matters. MSL uses column major matrices, and the standard operation for transforming a column vector is:
float4 position = modelViewProjection * float4(worldPos, 1.0);
This is the form you will see in almost every vertex shader. The float4(worldPos, 1.0) appends a 1.0 to convert a 3D position into a homogeneous vector, then multiplies by the combined MVP matrix to produce a clip space position.
Matrix matrix multiplication:
float4x4 mvp = projection * view * model;
The transpose() and determinant() functions are in the standard library. inverse() is notably absent in some contexts; for normal matrix computation, the inverse transpose of the model matrix is usually precomputed on the CPU and passed as a uniform, not computed per vertex in the shader.
Column access:
float4x4 m = /* ... */;
float4 col0 = m[0]; // first column
float element = m[1][2]; // column 1, row 2
Packed vector types
Standard vector types are aligned to their size. A float3 is 16-byte aligned, occupying the same memory as a float4 with its fourth component unused. This alignment is required for the hardware but wastes memory when you pack vertex data tightly.
When vertex data arrives from a CPU side buffer where positions, normals, and UV coordinates are packed contiguously without padding, you use packed vector types:
packed_float2 // 8 bytes, 4-byte aligned
packed_float3 // 12 bytes, 4-byte aligned (no padding)
packed_float4 // 16 bytes, 4-byte aligned
struct Vertex {
packed_float3 position;
packed_float3 normal;
packed_float2 uv;
};
A packed vertex structure matches the memory layout the CPU side produces naturally. The shader reads the packed representation and can convert to the aligned type for arithmetic:
float3 pos = float3(vertex.position); // packed_float3 to float3
Arithmetic on packed types is slower than on aligned types on some hardware. Read from packed, compute with aligned, is the pattern.
Type conversions
Implicit conversions between numeric types exist but follow narrower rules than in C++. Converting from int to float is implicit. Converting from float to int requires an explicit cast. Converting between vector types of different widths requires explicit construction.
int i = 5;
float f = float(i); // explicit, fine
float g = i; // implicit promotion, also fine
float4 v4 = float4(1.0, 2.0, 3.0, 4.0);
float3 v3 = float3(v4.xyz); // explicit construction from swizzle
float3 v3b = v4.xyz; // also fine, swizzle result is already float3
Reinterpreting the bit pattern of one type as another uses as_type<T>():
float f = 1.0f;
uint bits = as_type<uint>(f); // IEEE 754 bits of 1.0f, which is 0x3F800000
This is useful for packing data, computing float based hashing tricks, and debugging bit patterns. It does not perform any arithmetic conversion. The bits transfer unchanged.
Boolean vectors and comparison
Comparison operators on vectors return bool vectors:
float4 a = float4(1.0, 2.0, 3.0, 4.0);
float4 b = float4(1.0, 3.0, 2.0, 4.0);
bool4 equal = a == b; // (true, false, false, true)
bool4 greater = a > b; // (false, false, true, false)
The any() and all() functions reduce a bool vector to a scalar:
bool anyTrue = any(equal); // true, because at least one component is true
bool allTrue = all(equal); // false, not all components are true
The select() function is the vector equivalent of the ternary operator:
// select(false_value, true_value, condition)
float4 result = select(a, b, equal); // (a.x, b.y, b.z, b.w)
// where equal is (true, false, false, true):
// result.x = b.x (condition true), result.y = a.y (condition false), etc.
// Wait: select picks b where condition is true, a where false
// result = (1.0, 2.0, 2.0, 4.0)
select is preferable to branching in shaders because it avoids thread divergence. Both branches compute their values, and the condition selects which result to keep, all within a single instruction without splitting the SIMD group.
What this type system is for
Every type in this chapter serves the same purpose: enabling the GPU to operate on groups of values in single instructions. When you write float4 color = albedo * lighting, four multiplications happen at once in the same number of cycles as one. When you write float3 n = normalize(normal), the dot product, square root, and division happen in vector instructions across all three components simultaneously.
The type system is the direct interface to the parallelism that makes the hardware fast.
The programmer who reaches for a float when they need a float3 and then computes three separate multiplications has written correct code that runs at a fraction of the available throughput. The programmer who understands the type system writes the same operation in one line.
Part 4 covers address spaces: the explicit annotations that tell the GPU where data lives, which determines how fast you can access it and who else can touch it.
Read the rest of the series
- Part 1: The Machine That Thinks in Parallel
- Part 2: The Pipeline and the Three Functions
- Part 3: Vectors, Matrices, and the Art of Swizzling
- Part 4: Address Spaces and Where Data Lives
- Part 5: Threads, Threadgroups, and the Dispatch Model
- Part 6: Textures, Samplers, and Reading Image Data
- Part 7: The Standard Library and Writing Real Shaders