The Speech Framework in iOS 26

We got the original Speech framework back in 2016. For a decade, it ran on exactly one abstraction: SFSpeechRecognizer. You gave it audio, it sent that audio somewhere into the void, and eventually it fired a callback with a text string. The API was tiny. The behavior was completely opaque. You just asked for permission, attached a handler, and hoped it worked.

The new API in iOS 26 throws that entire model in the trash.

The new center of gravity is SpeechAnalyzer. The old recognizer was a black box that only did one thing. The analyzer is just a traffic cop. It accepts an array of distinct modules, and it routes your audio buffers through each of them. Transcription is one module. Voice activity detection is a completely separate module. Every module exposes its own dedicated result stream.

You configure exactly what you need, and you monitor exactly what you asked for.

How the modules actually work

Every capability in this new framework conforms to the SpeechModule protocol. The actual contract is very simple. A module declares which AVAudioFormat instances it understands, and it exposes an asynchronous stream of results. Modules that care about language will also conform to a locale protocol that helps resolve the system settings into something the framework actually understands.

Apple ships three concrete modules out of the box.

SpeechTranscriber is the main engine. It converts audio to text. You give it a locale and some reporting options, and it spits back chunks of AttributedString text. The amazing part is that every single recognized word can carry an embedded time range attribute marking exactly where it appeared in the audio buffer. It also carries a confidence score attribute. Both are baked straight into the text result, so you can pull them out using standard attributed string APIs.

DictationTranscriber is heavily tuned for human input. You use this when a user is actively tapping a microphone button to talk to your app. You can pass it environmental hints so the engine knows how to behave. You can tell it the speech is atypical, or that the user is holding the phone far away from their mouth. It even supports automatic emoji and punctuation insertion based strictly on the user's vocal inflections.

SpeechDetector is completely different. It literally only tells you if a human is speaking right now. It returns a boolean. There is no transcription, no language modeling, and absolutely zero network calls. You can dial its sensitivity to low, medium, or high. This is what you should use to gate your heavy transcription logic. Running the detector first stops you from feeding dead silence into the actual transcriber.

Creating an analyzer

The basic setup just takes an array of modules.

let transcriber = SpeechTranscriber(
    locale: Locale(identifier: "en-US"),
    preset: .progressiveTranscription
)

let analyzer = SpeechAnalyzer(modules: [transcriber])

The options dictionary controls the underlying task priority and the model retention policy. Retention policy dictates how long the massive ML model stays in RAM after the analyzer finishes its job. A policy of .whileInUse aggressively dumps the model from memory immediately. A policy of .lingering keeps it hot for a few seconds just in case the user taps the microphone button again. Keeping it lingering avoids a giant latency spike on the next request. The right choice completely depends on how often your users hit the feature.

Audio flows into the system through an AnalyzerInput struct. This is basically just an AVAudioPCMBuffer with an optional start time attached to it. You feed these into the analyzer using an asynchronous sequence. The analyzer pulls them in order and pushes the frames to your configured modules.

Before you start pumping audio, you have to call prepareToAnalyze(in:). This warms up the ML model and suspends your code until the engine is actually ready. You have to tell it exactly what audio format you plan to provide. If you guess the format wrong here, the analyzer will throw an error the second you feed it live microphone data.

let format = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber])

do {
    try await analyzer.prepareToAnalyze(in: format)
} catch {
    // Handling missing models
}

Volatile results and finalization

The biggest departure from the old API is how this framework handles partial results.

When you process live microphone audio, the recognizer's early guesses are volatile. It only commits to a block of text after it hears enough trailing context to be mathematically confident. Until that moment, it just emits a stream of volatile guesses. It will constantly change its mind about what the user just said as more audio arrives.

The analyzer makes this explicit through the volatileRange property. You can literally subscribe to this range and watch it grow and shrink as the engine tries to figure out the sentence structure.

This forces an architectural discipline we really lacked before. The old recognizer just dumped random partial text updates on our delegate method. Most of us just shoved that exact text directly into a UI label. That is exactly why users saw the text constantly flickering and overwriting itself. Now, we actually have the data required to visually separate the safely committed text from the temporary guesses.

Ending a session is extremely rigid. When the user stops talking, you call finalizeAndFinishThroughEndOfInput(). That tells the system to commit all the volatile guesses, close the stream, and drop the memory.

Configuring presets

Building a transcriber requires setting up reporting options and attribute flags.

The standard SpeechTranscriber has a preset called progressiveTranscription that gives you the volatile streams we just talked about. If you want timing data, you add the timeIndexed variant to force every word to report its exact millisecond offset. If the audio quality is terrible, you can request alternative transcriptions. That forces the engine to return an array of backup guesses whenever it feels unconfident about its primary result.

DictationTranscriber behaves a little differently. It adds a frequentFinalization reporting option. Normal transcription waits until it is highly confident before committing a sentence. Dictation optimizes for raw visual speed because the user is staring at the screen waiting for their words to appear. Frequent finalization forces the engine to commit text instantly, accepting that the grammar might be slightly worse.

Watch out for .fastResults. It forces the model to return text way earlier than its internal safety threshold normally allows. You get incredible speed, but you absolutely slaughter your accuracy. It is perfect if you just want to make the UI look immediately responsive, but it is a terrible idea if you are actually saving that transcript to a database.

Locale and asset management

We have always needed massive local models for offline recognition. The new API brings that asset lifecycle front and center using AssetInventory.

Checking the status of a module returns a clear state. It is either installed, supported but not downloaded, currently downloading, or completely unsupported by the current hardware.

When a language is supported but missing, you can trigger an explicit download request.

if let request = try await AssetInventory.assetInstallationRequest(supporting: [transcriber]) {
    try await request.downloadAndInstall()
}

If you are building an app with heavy multilingual support, the framework lets you explicitly reserve up to the maximum allowed number of locales. Reserving them pins the ML models to the local storage. That guarantees that when a user swaps their app language in the settings menu, they do not have to wait for a massive background download before they can dictate a message.

Feeding context to the model

AnalysisContext lets you feed domain specific hints to the recognizer while it runs. You do this by passing a dictionary of contextual strings.

If you are building a restaurant app, you pass the names of local dishes. If you are building a messaging app, you pass the names in the user's address book. If you are building an IDE, you pass the variables from the current file.

These context strings just add mathematical weight to specific words in the scoring engine. They do not override the actual phonetic dictionary. They just tip the scales when the audio sounds ambiguous. The effect is massive when you have users speaking extremely specific technical vocabulary that the base model would normally get wrong.

Custom language models

Sometimes adding keyword weight is not enough. When that happens, you can actually build a completely custom language model and bake it directly into the recognizer.

You build this custom model using a dedicated Swift result builder. You feed it a specific locale and version string, and then you just dump hundreds of weighted phrase counts into it.

let customData = SFCustomLanguageModelData(
    locale: Locale(identifier: "en-US"),
    identifier: "com.yourapp.domain",
    version: "1.0"
) {
    SFCustomLanguageModelData.PhraseCount(phrase: "dark mode toggle", count: 100)
    SFCustomLanguageModelData.PhraseCount(phrase: "increase font size", count: 80)
    SFCustomLanguageModelData.PhraseCount(phrase: "open settings", count: 120)
}

let exportURL = URL(fileURLWithPath: "...")
try await customData.export(to: exportURL)

You export that data, wrap it in a configuration, and hand it to the transcriber. Doing this gives your custom vocabulary a massive boost without tearing down the foundational English model underneath it. It is absolutely essential for medical phrasing or highly specialized commercial applications.

Voice analytics

The framework actually spits out raw voice analytics alongside the transcription string.

You get four distinct acoustic features mapped to every single frame of audio. The voicing feature tells you the mathematical probability that the current frame contains a human voice. The pitch feature tracks the exact vocal frequency. The jitter and shimmer features track tiny variations in that pitch and amplitude.

The clinical uses for this are obvious. But for standard apps, the use cases are pretty narrow. You could use it to measure user frustration during long sessions or distinguish between multiple speakers in a crowded room. Deciding what to do with that raw floating point data is entirely your problem. Apple gives you the signals and nothing else.

The actual cost of adoption

Let's be honest. Setting up a complete SpeechAnalyzer pipeline requires significantly more boilerplate than the old recognizer ever did.

You have to handle format negotiation, model warmup, asynchronous buffers, and explicit cleanup sequences. Every single decision the old API hid behind opaque callbacks is now a mandatory manual step. That extra friction is the price you pay for predictability. The more your assumptions are encoded into the actual pipeline logic rather than discovered through trial and error, the more reliably your feature will perform across different devices.