Audio

May 09, 2026 14 min read

Voice AI in 2026: Conversation Is a Prediction Problem

The voice AI industry spent years optimising the wrong object.

Voice AI in 2026: Conversation Is a Prediction Problem

For most of the last decade, the engineering frame was simple: recognise speech, pass the transcript to a language model, generate a response, synthesise the answer, and play it back. Faster recogniser. Faster model. Faster synthesiser. Reduce the gaps between the boxes and natural conversation would supposedly emerge.

It has not.

That pipeline can produce agents that sound polished, answer correctly, and still fail the moment the interaction becomes human: when the user pauses mid-thought, restarts a sentence, talks over the agent, sighs, hesitates, changes their mind, or says something that requires the agent to not respond. The failure is not merely latency. It is not merely accuracy. It is the absence of a real-time conversational control system.

Conversation is not a sequence of API calls. It is a coupled prediction loop. Each participant is listening, forecasting, preparing, inhibiting, repairing, and timing behaviour against the other person’s unfolding signal. Human conversation is not input → process → output. It is two agents continuously predicting each other at speech rate.

That is the right frame for voice AI in 2026.

The systems that now matter — full-duplex speech models such as Kyutai’s Moshi, controllable variants such as NVIDIA’s PersonaPlex, OpenAI’s current Realtime family around GPT‑Realtime‑2, and the better streaming enterprise cascades — are not important simply because they are “speech-to-speech”. They are important because they move the control loop closer to the grain of conversation. Listening, reasoning, tool use, interruption handling, and speech generation increasingly happen inside a live session rather than across a chain of batch calls.

The important distinction in 2026 is not cascade versus end-to-end. Cascades are not dead. In regulated environments, they remain useful because they produce transcripts, tool traces, component-level observability, and clearer compliance boundaries. The real distinction is batch-oriented versus streaming-native. A cascade can work if it behaves like a real-time system. A speech-native model can fail if it hides too much state and gives operators no way to inspect, govern, or recover the interaction.

The technical question is no longer: can the system talk?

It is: can the system participate?

The human benchmark leaves no room for a reactive system

Human turn-taking is brutally fast. Levinson and Torreira describe the central timing puzzle: gaps between conversational turns are often around 200 ms, while the latencies involved in planning speech are typically much longer, often above 600 ms. A purely reactive system cannot make that arithmetic work. People respond quickly because they start preparing before the other speaker has finished. They project where the turn is going, when it will end, and what kind of action it is performing.

That fact should be treated as an architectural constraint.

If a voice agent waits for acoustic silence, finalises ASR, sends a transcript to an LLM, waits for the first token, streams text to TTS, buffers audio, and then speaks, the conversational budget has already gone. The model may still answer correctly. It will still feel late.

But the answer is not “always respond faster”. A system that speaks before the user has yielded the floor is not perceived as intelligent. It is perceived as impatient. The target is appropriate timing: knowing when to speak, when to wait, when to backchannel, when to ask a repair question, and when silence is the correct response.

That is why “300 ms latency” is no longer an adequate claim. The useful question is what the system is doing during those 300 ms, and whether it began doing it while the user was still speaking.

Voice is the access layer for non-coders

Fast real-time conversation will be one of the largest practical improvements to LLM usability, especially for people who are not interested in coding.

Programmers already have high-leverage interfaces to models. They can express intent through precise text, scripts, APIs, diffs, tests, stack traces, and structured workflows. They are used to turning vague goals into formal instructions.

Most people do not want to work that way.

For non-coders, the barrier to using LLMs well is often not intelligence. It is interface discipline. Written prompting requires the user to pre-structure intent, provide context, anticipate constraints, evaluate output, and iterate through text. That is tolerable for expert users. It is not how most people naturally reason through ambiguous problems.

Real-time voice lowers that interface tax. It lets users think aloud, revise mid-sentence, interrupt, clarify, abandon a line of thought, and build an answer through ordinary conversational repair. The interaction becomes less like operating a tool and more like externalising thought into a responsive partner.

That matters commercially. The next wave of LLM adoption will not be driven only by developers automating code. It will be driven by people using models to reason through work they cannot or do not want to formalise: planning, tutoring, negotiation, customer support, operations, coaching, administration, decision support, and research triage.

For that group, voice is not “chat with audio”. It is the first interface that makes LLMs feel cognitively native.

Latency is behaviour, not a benchmark number

Most voice AI demos quote a single latency number. Users do not experience a single number. They experience a distribution, and the tail of that distribution is where trust dies.

The most important responsiveness metric is not time-to-first-token. It is time-to-first-audio: the interval from the end of the user’s relevant speech to the first audible frame from the agent. Even that is incomplete. A better metric is time-to-first-useful-audio: the interval until the system says something that actually advances the interaction.

A production latency budget has many moving parts:

Stage	Healthy target	Common source of delay
Audio capture and packetisation	20–40 ms	Microphone stack, browser buffering, codec frame size
Network transport	20–100 ms	Region, routing, WebRTC/WebSocket/SIP path
Streaming ASR or speech encoding	40–200 ms	Chunk size, lookahead, partial stability
Turn detection	50–500 ms	VAD, semantic endpointing, policy threshold
First model action	50–400 ms	Prefill, context length, routing, tool planning
Tool path	100 ms–seconds	CRM, search, RAG, payments, cold caches
TTS first chunk	40–300 ms	Model family, prosody planning, chunking
Playback and jitter buffer	20–100 ms	Client buffering, packet loss, scheduling

The model is rarely the whole story. P95 and P99 are often dominated by infrastructure: distant regions, cold GPU allocation, overloaded tool services, slow retrieval, poor cache discipline, telephony ingress, or conservative endpointing that adds dead air to every turn.

OpenAI’s own 2026 infrastructure writing makes the same point from the systems side. Its WebRTC post frames natural voice as requiring fast connection setup, low and stable media round-trip time, low jitter, and low packet loss; it also describes WebRTC as standardising the hard parts of interactive media, including NAT traversal, encrypted transport, codec negotiation, echo cancellation, and jitter buffering.

Even codec settings matter. Opus can encode 2.5, 5, 10, 20, 40, or 60 ms frames; larger frames reduce packet overhead but increase latency and packet-loss sensitivity. The RFC explicitly notes 20 ms as a good choice for most applications. That is a low-level detail, but real-time products live or die in low-level details.

This is the first hard lesson of production voice AI: latency is not a property of a model. It is an emergent property of the whole media, inference, tool, and playback path.

Endpointing is four problems, not one

Endpointing is where many voice agents become visibly clumsy. Most teams still treat it as a single problem. It is not.

First, there is voice activity detection: is there speech energy in this frame? This is acoustic, fast, local, and necessary. It cannot determine whether the user has finished their thought.

Second, there is utterance completion: is the current phrase or clause complete? “I need to fly from Sydney to…” is incomplete even if followed by silence. “That’s all, thanks” is complete even if the pause is short.

Third, there is floor yielding: does the user want the agent to take the turn? This is not the same as finishing a sentence. People complete clauses and continue. They pause to think. They hold the floor with intonation. They trail off because they are uncertain. A system that treats every silence as permission to speak behaves like a bad IVR.

Fourth, there is endpointing policy: what should the system do given noisy evidence? A tutor should often be patient. A command-and-control interface should be brisk. A healthcare intake agent should tolerate hesitation. A collections bot may need different interruption rules. There is no universal threshold. There is only an operating point on a false-interruption versus dead-air trade-off curve.

This is why silence-threshold endpointing is a dead end. Silence is not a conversational act. It can mean completion, hesitation, breath, uncertainty, emotion, lexical retrieval, a device artefact, or a network problem.

In a streaming-native system, endpointing is not the moment inference begins. Inference may already be running. Endpointing is the moment provisional interpretation becomes speakable action.

The architecture market is stratifying

Voice AI will not converge on one architecture. It is stratifying into at least four useful categories.

The first is the streaming cascade. This remains the practical enterprise architecture: ASR, LLM, tools, memory, and TTS stay separable, but every stage runs incrementally. A good cascade emits partial transcripts, tracks their stability, prepares provisional responses, starts safe tool prefetch, streams model output into TTS, and cancels or revises when the user continues.

This is not glamorous, but it is often correct. Banks, insurers, hospitals, public-sector services, and contact centres need transcripts, redaction, audit trails, tool logs, human escalation, vendor substitutability, and testable policy boundaries. A monolithic speech-to-speech model may sound more natural, but it can be harder to govern.

The second category is the discrete codec-token model. Moshi is the canonical public example. Moshi casts spoken dialogue as speech-to-speech generation, models user and assistant speech as parallel streams, and uses audio codec tokens rather than forcing the interaction through text. Its paper reports full-duplex dialogue, no explicit speaker-turn segmentation, Inner Monologue text scaffolding, theoretical latency of 160 ms, and practical latency around 200 ms.

That architecture matters because it makes overlap, backchannels, silence, interruption, hesitation, laughter, and prosody native. In a transcript-first system, those behaviours are either discarded or reconstructed later. In a codec-token system, they are part of the sequence.

The third category is controllable full duplex. PersonaPlex is important here. NVIDIA describes it as a real-time, full-duplex speech-to-speech model that listens and speaks at the same time, learns behaviours such as pausing, interruption, and backchannelling, and adds role and voice control. Its repository describes PersonaPlex as based on Moshi architecture and weights, with persona control through text-based role prompts and audio-based voice conditioning.

That is the next commercial axis. Full-duplex naturalism alone is not enough. Enterprise systems need role adherence, tool discipline, brand tone, escalation rules, and safety constraints. A voice model that sounds natural but cannot be governed is a demo, not a deployment.

The fourth category is the real-time session platform. OpenAI is the clearest example. Its current Realtime model surface includes GPT‑Realtime‑2 for real-time voice interactions, GPT‑Realtime‑Translate for streaming speech-to-speech translation, and GPT‑Realtime‑Whisper for streaming transcription. OpenAI introduced GPT‑Realtime‑2 as a voice model with GPT‑5-class reasoning, alongside live translation and streaming transcription models.

This is not merely a model taxonomy. It is a deployment taxonomy. A voice agent, a live interpreter, and a transcription system do not share the same failure modes. OpenAI’s segmentation reflects where the industry is going: real-time voice is becoming a platform layer, not a single feature.

The older question was “which model is best?” The better question is: which representation and runtime match the failure modes of this use case?

OpenAI’s current voice stack changes the reference point

OpenAI’s newest voice models need to be handled carefully in this article because they materially update the industry picture.

As of May 2026, GPT‑Realtime‑2 is the relevant OpenAI frontier reference for real-time voice agents. OpenAI describes the model as having GPT‑5-class reasoning for harder requests, better context handling, and more natural conversations. GPT‑Realtime‑Translate targets live speech-to-speech translation, and GPT‑Realtime‑Whisper targets streaming speech-to-text while someone is speaking.

That matters because OpenAI is no longer presenting voice as a single generic audio mode. It is separating the workload into three surfaces:

Real-time voice agents, where speech, reasoning, tools, interruption handling, and session state interact.

Live translation, where the goal is to keep pace with a speaker across languages.

Streaming transcription, where the key trade-off is latency versus transcript quality. OpenAI’s Realtime guide explicitly notes that GPT‑Realtime‑Whisper gives developers controllable latency: lower delay settings produce earlier partial text, while higher delay settings can improve transcript quality.

Pricing also signals the split. OpenAI’s public pricing page lists GPT‑Realtime‑2 across audio, text, and image tokens, while GPT‑Realtime‑Translate and GPT‑Realtime‑Whisper are priced by real-time audio duration.

That distinction is important. A full voice agent is a multimodal, stateful reasoning service. A streaming transcription or translation model is closer to media processing. Buyers will experience both as “voice AI”, but the cost structure, observability requirements, and product risks are different.

The strategic point is clear: voice AI is moving from speech interface to real-time operating layer.

Production failures are mostly unsexy

The gap between demo and deployment is usually not caused by the thing in the keynote. It is caused by echo, tools, transport, session state, and long-tail latency.

Acoustic echo cancellation is the first example. If the agent speaks while listening, the microphone will capture some of the agent’s own playback. Without AEC, the system hears itself, transcribes itself, triggers its own VAD, interrupts itself, or enters a feedback loop. This is not an edge case. It is the baseline physics of speakers and microphones.

This is why WebRTC is often the pragmatic default for browser and mobile voice agents. It brings media machinery that raw WebSocket audio does not. WebSocket can work, but then the application owns more of the audio reliability problem. SIP is necessary for telephony, but phone networks are not low-latency conversational media.

Tool calls are the second failure point. A voice agent that responds in 300 ms until it touches a CRM is not a 300 ms agent. It is a 300 ms agent with a multi-second cliff. The product question is not whether the model is fast. It is what the agent does while a tool is slow. Does it stall? Fill? Explain? Continue with safe context? Ask a useful clarifying question? Abort? Retry? Escalate?

Session economics are the third failure point. A voice interaction is not just a longer chat. It is a live service-level commitment. The system must maintain state, listen continuously, handle barge-in, preserve context, route audio, manage caches, monitor tools, and keep the user engaged while infrastructure does unpredictable things.

This is why voice economics do not reduce neatly to tokens. They include audio tokens, text tokens, cached context, media transport, telephony, retrieval, tool calls, observability, redaction, post-call analysis, escalation, and retries. The operational unit that matters is not the token or the minute. It is the successfully resolved interaction.

Long sessions fail by losing goal salience

Short demos flatter voice agents. Long calls expose them.

A five-turn voice demo can feel excellent. A twenty-minute support call often reveals a different pattern: the agent remains locally fluent but becomes globally less directed. It repeats acknowledgements. It loses the original objective. It answers an earlier version of the problem. It keeps the conversation moving while gradually losing control of the task.

The closest cognitive analogy is goal neglect: failing to enact a task requirement despite understanding it. Duncan’s work ties goal neglect to conditions where task requirements are known but fail to guide behaviour. Later formulations define it as failure to enact task requirements despite being able to report them.

The analogy should not be overextended. LLMs are not frontal-lobe patients. But the structural resemblance is useful. The failure is often not forgetting. It is failure to keep the goal active enough to control moment-to-moment behaviour.

Long voice sessions create exactly this pressure. The conversation history grows. Recent turns compete with earlier commitments. Tool results, user corrections, policy constraints, emotional tone, and procedural steps accumulate. The model may still contain the relevant information somewhere in context, but the information no longer has enough control weight.

The answer is not simply a larger context window. Larger context can make the problem worse if it increases competition without improving salience.

Production voice agents need explicit task-state architecture: current goal, confirmed facts, unresolved slots, promises made, identity status, policy constraints, next required action, escalation criteria, and forbidden actions. In complex domains, the conversation should be event-sourced. Every user correction, consent statement, tool call, and decision should become a structured event, not just another line in a transcript.

The best long-session voice agents will behave less like chatbots with memory and more like operators with a live case file.

Evaluation has to move from components to interaction

Most voice AI evaluation is still component-centric. ASR gets word error rate. TTS gets mean opinion score. The LLM gets task accuracy or tool-call success. These are useful metrics, but they do not measure the thing users experience: a timed interaction.

A serious voice evaluation suite should measure six categories.

Latency distribution: P50, P90, P95, and P99 time-to-first-audio, time-to-first-useful-audio, tool-call latency, barge-in latency, and interruption-recovery latency.

Recognition behaviour: WER, entity error rate, speaker attribution, language-switch handling, partial hypothesis churn, and semantic reversal rate.

Turn-taking behaviour: false interruptions, missed responses, dead-air duration, backchannel appropriateness, overlap handling, and repair after cross-talk.

Task behaviour: completion rate, escalation rate, hallucinated action rate, policy-violation rate, repeated-question rate, and turns to resolution.

Long-session behaviour: goal retention, contradiction rate, context drift, duplicate actions, summarisation errors, and degradation after ten, twenty, and forty minutes.

Operational behaviour: concurrency, queue depth, regional variance, cold-start rate, AEC failure rate, packet-loss tolerance, transcript completeness, and redaction accuracy.

The research community is starting to evaluate turn-taking directly. The Talking Turns benchmark proposes evaluating whether audio foundation models can understand, predict, and perform turn-taking events, rather than merely generate fluent audio.

That is the right direction. Spoken interaction is not just language plus sound. It is timing, interruption, silence, repair, and control.

The two metrics I would prioritise in most production programmes are barge-in latency and tool-call P99. Barge-in latency is the clearest behavioural signal that the agent is actually listening while speaking. Tool-call P99 is where most “fast model” stories collapse under real customer load.

The procurement questions that matter

Most buyer conversations still start with model names. That is the wrong anchor.

The useful questions are operational:

What is your P95 and P99 time-to-first-useful-audio under production load, in our users’ regions?

How do you separate VAD, utterance completion, floor-yielding, and endpointing policy?

What is barge-in latency from overlapping user speech to actual audio stop?

What happens when acoustic echo cancellation fails?

Where do ASR, LLM, TTS, turn detection, and tool services run?

What happens when a tool call takes four seconds?

How does the system behave after twenty minutes of real conversation?

What transcript, redaction, consent, and tool-call audit trail can compliance inspect?

If the system is speech-native, how are transcripts generated, verified, and reconciled against the original audio?

If the system uses custom voices or voice cloning, what consent, watermarking, and abuse controls exist?

A vendor that can answer these crisply is operating at the level this technology now requires. A vendor that returns to demo clips and generic latency claims is not.

The frame that matters

The mistake that creates brittle voice products is treating voice AI as a chatbot with a microphone. That frame makes latency look like the central problem, ASR accuracy look like the bottleneck, and TTS quality look like the differentiator.

The better frame is real-time social interface.

A voice agent is a system that participates in a temporally coordinated interaction. Content correctness is necessary, but not sufficient. The agent must manage timing, interruption, hesitation, emotion, repair, memory, tools, and task state under live acoustic and infrastructure constraints.

That is why the architecture is changing.

Full-duplex models matter because they make overlap, silence, backchannels, and interruption native.

Streaming cascades matter because enterprises need auditability and control without reverting to batch latency.

Realtime APIs matter because voice agents require stateful sessions, telephony, WebRTC, tools, and event-level orchestration.

Learned turn detectors matter because silence is not the same as conversational completion.

AEC matters because full duplex is impossible if the agent hears itself.

Structured memory matters because long calls fail when goals lose salience.

OpenAI’s 2026 voice releases are a useful marker of the direction. GPT‑Realtime‑2 pushes real-time speech towards agentic reasoning and tool use. GPT‑Realtime‑Translate treats live multilingual conversation as a first-class product surface. GPT‑Realtime‑Whisper treats low-latency transcription as a streaming primitive rather than an offline batch task.

The deeper implication is access. Fast conversational AI gives non-coders a way to use frontier models without learning the grammar of prompts, APIs, or software workflows. When the model can keep up with natural speech, interruption, uncertainty, and revision, it stops being a tool you operate and becomes a system you can think with.

The next generation of voice AI will not be judged mainly by whether it can produce realistic speech. That problem is commoditising. It will be judged by whether it can sustain a useful, safe, well-timed interaction under real production conditions.

Conversation is a prediction problem.

The systems that win will be the ones that predict the conversation while they are still inside it.

Subscribe to my newsletter

Subscribe to my newsletter to get the latest updates and news