Voice & Audio8 min readNovember 10, 2024

Real-Time Transcription: What 'Instant' Actually Means

The difference between instant voice typing and almost instant is about 400 milliseconds. Here is where every one of them goes.

TL;DR

Deep dive into voice transcription latency and response times. Learn how streaming speech to text works and why sub-700ms transcription makes voice typing feel instant on Mac.

When people describe voice typing as fast or slow, they are usually talking about the perceived delay between speaking and seeing their words appear on screen. This perceived latency is the single most important factor in whether voice input feels magical or frustrating. At Air, we obsess over every millisecond in our transcription pipeline because we know that crossing certain thresholds completely changes the user experience.

Understanding Voice Transcription Latency

Latency in voice transcription refers to the time between when you finish speaking a word and when that word appears on your screen. This is different from processing time, which is how long the actual computation takes. Users care about latency because it affects how natural the interaction feels.

Research in human computer interaction has identified several important thresholds for perceived responsiveness. Delays under 100 milliseconds feel instantaneous to most people. Between 100 and 300 milliseconds, users notice a slight lag but still perceive the system as responsive. Above 500 milliseconds, the delay becomes distracting. Above 1 second, users lose confidence that the system is working correctly.

Our goal with Air is to keep transcription latency under 700 milliseconds in the worst case and closer to 400 milliseconds in typical conditions. This puts us well within the range where voice input feels responsive and natural.

Breaking Down the Voice Transcription Pipeline

To understand where latency comes from, you need to understand the complete pipeline that your voice goes through from speaking to seeing text on screen. Each stage adds some delay, and optimizing each one is crucial for achieving fast perceived response times.

The first stage is audio buffering. When you speak into your microphone, the raw audio data arrives as a continuous stream. We need to collect enough audio data to send a meaningful chunk to our transcription service. If we send chunks that are too small, we waste bandwidth on overhead. If we send chunks that are too large, we add unnecessary delay. We have tuned our buffer size to approximately 85 milliseconds, which provides enough audio for accurate transcription while minimizing delay.

The second stage is network transmission. Your audio data needs to travel from your Mac to our transcription servers and the results need to come back. For a typical internet connection, this round trip takes about 50 milliseconds. We have invested in server infrastructure that minimizes geographic distance to our users and uses optimized network paths to reduce this latency as much as possible.

The third stage is transcription processing. This is where the actual speech recognition happens. Our transcription service analyzes the audio waveform, identifies speech patterns, and converts them to text. Even with highly optimized machine learning models running on powerful hardware, this step takes around 200 milliseconds per chunk of audio.

The fourth stage is formatting and display. Once we receive the transcribed text, we need to process it for display, handle punctuation and capitalization, and update the user interface. This adds approximately 100 milliseconds in typical conditions.

Streaming Transcription vs Batch Processing

Many voice typing applications use batch processing, which means they wait until you stop speaking and then process your entire utterance at once. This approach is simpler to implement but results in noticeably longer perceived latency. You speak an entire sentence, pause, wait, and then see all the words appear together.

We use streaming transcription instead. Rather than waiting for you to finish speaking, we continuously send small chunks of audio to our transcription service and display partial results as they arrive. You see words appearing on screen while you are still speaking.

This streaming approach requires more sophisticated engineering. We need to handle partial results that might change as more context becomes available. The word "there" might initially appear as "the" before the full syllable is processed. We need to gracefully update the display as the transcription refines itself without creating jarring visual jumps.

The benefit is that perceived latency drops dramatically. Even though the final accurate transcription of a word might take 400 milliseconds from when you spoke it, you see something appearing on screen almost immediately. The visual feedback confirms that the system is listening and working, which makes the whole experience feel much more responsive.

Optimizing Every Stage of the Pipeline

We have invested significant engineering effort in optimizing each stage of our transcription pipeline. Every millisecond we save translates directly into a better user experience.

For audio buffering, we use a dynamic buffer size that adapts to network conditions. If your connection is fast and stable, we use smaller buffers for lower latency. If the network is congested, we use slightly larger buffers to reduce overhead and improve reliability.

For network transmission, we maintain persistent connections to our transcription servers so there is no connection setup delay for each audio chunk. We use modern protocols optimized for low latency rather than high throughput. We have deployed servers in multiple geographic regions so your audio does not have to travel far.

For transcription processing, we use the latest speech recognition models optimized for speed as well as accuracy. We run multiple models in parallel and return the first result that meets our confidence threshold. We cache common words and phrases to speed up recognition.

For display, we use careful interface design that updates smoothly without visual flicker or jarring changes. Partial results appear in a way that naturally flows into the final transcription.

Why Latency Matters More Than Raw Speed

You might wonder why we focus so much on latency rather than raw transcription speed. After all, modern speech recognition can process audio faster than real time. The answer is that what matters for user experience is not how fast we can process your voice, but how quickly you see feedback.

Imagine two voice typing systems. The first one processes your speech instantly but only shows results in batches after you pause. The second one has the same total processing time but streams results as you speak. The second system will feel dramatically more responsive even though the actual computation is identical.

This is why we prioritize streaming architecture over batch processing. This is why we invest in reducing latency at every stage rather than just optimizing total throughput. The goal is not to process speech as fast as possible. The goal is to make voice input feel natural and immediate.

The Psychology of Responsive Voice Input

When voice input feels instant, it changes how people use it. Users speak more naturally without pausing to check if the system is keeping up. They attempt longer utterances without worrying about the system losing their place. They integrate voice into their workflow rather than treating it as a separate mode.

When voice input feels slow, even by a few hundred milliseconds, users adapt their behavior in unproductive ways. They speak more slowly and deliberately. They pause frequently to verify the transcription. They avoid voice input for anything complex or time sensitive.

Our target of sub-700 millisecond latency is not arbitrary. It is based on research and user testing that identified the threshold where voice input stops feeling like a separate mode and starts feeling like a natural extension of your typing. Below this threshold, you can speak at your normal pace and trust that your words will appear correctly. Above this threshold, you start to second guess the system and slow down.

Every optimization we make is in service of staying below this threshold as consistently as possible. Fast voice input is not just about convenience. It is about making voice a tool you can rely on without thinking about it.