Full-duplex voice agents fail at the systems level before they fail at the model level. The dominant failure mode is not insufficient intelligence — it is cumulative pipeline latency that breaches the 400ms human-perceptibility threshold, combined with barge-in logic that either fires too aggressively or not at all. This article dissects the architectural decisions that determine whether a voice agent feels like a conversation or a walkie-talkie.
The implementation stack covered here: FastAPI for async WebSocket orchestration, Deepgram Nova-2 for streaming ASR, Gemini 2.0 Flash for reasoning, a standard TTS provider, and Redis for cross-instance state persistence on Google Cloud Run.
Deconstructing the Streaming-First Voice Architecture
Streaming ASR models reduce transcription latency from 200–400ms (batch mode) to 100–200ms when partial result emission is enabled. That 50–100ms difference is not marginal — it directly determines whether the LLM reasoning phase can begin before the user finishes speaking, which is the core mechanic enabling human-like response timing.
The pipeline is a directed acyclic graph of streaming stages, not a sequential request-response chain. Audio frames flow in at 20ms intervals, VAD runs per-frame, partial transcripts feed the LLM incrementally, and TTS synthesis begins on the first coherent token cluster — not after the full LLM response is complete.
sequenceDiagram
participant Client as WebRTC Client
participant GW as FastAPI WebSocket Gateway
participant VAD as VAD Engine
participant STT as Deepgram Nova-2 (STT)
participant LLM as Gemini 2.0 Flash
participant TTS as TTS Engine
participant Redis as Redis State Store
Client->>GW: Audio Frame (20ms chunks)
GW->>VAD: Forward frame for endpointing
VAD-->>GW: Speech / Silence / Interrupt signal
GW->>STT: Stream audio frames (WebSocket)
STT-->>GW: Partial transcript tokens
GW->>Redis: Load conversation context
Redis-->>GW: Serialized session state
GW->>LLM: Incremental prompt + context
LLM-->>GW: Streaming token chunks
GW->>TTS: Token clusters → synthesis request
TTS-->>GW: Audio frame chunks (streaming)
GW->>Client: PCM audio stream
Note over VAD,GW: Interrupt path: VAD fires → GW flushes TTS buffer immediately
WebSocket connections in this architecture require explicit heartbeat signals — typically a ping/pong frame every 15–20 seconds — to prevent load balancer timeout termination on GCP and AWS. This is non-negotiable in stateless container environments; a dropped connection mid-utterance destroys session context unless Redis backup is in place.
Establishing the Latency Budget for Multimodal Pipelines
Every component in the pipeline owns a slice of a 300–500ms total budget. The allocation is not arbitrary — it reflects the minimum achievable latency floor for each component class given current model architectures and network physics.
| Component | Target Budget | Minimum Achievable | Failure Mode if Exceeded |
|---|---|---|---|
| STT (Deepgram Nova-2) | 100–150ms | ~80ms (streaming) | LLM reasoning starts late; response feels sluggish |
| LLM Reasoning (Gemini 2.0 Flash) | 100–200ms | ~80ms (TTFB) | Most expensive slice; controls perceived intelligence |
| TTS Synthesis | 100–150ms | ~60ms (streaming first chunk) | Audio onset delay is perceptually salient |
| Network RTT (WebSocket) | 20–40ms | ~10ms (co-located) | Compounded across both audio directions |
| VAD + Orchestration | <15ms | ~5ms | Barge-in detection degrades below 50ms response |
| Redis State I/O | <5ms | ~1–2ms | Negligible; Redis adds <2ms per operation |
| Total Target | ≤300ms TTFB | ~250ms | Perceptible lag above 400ms; agent feels broken |
Deepgram Nova-2 achieves streaming transcription latency as low as 150ms under real-world network conditions, with partial result emission beginning within the first 100ms of voiced audio. This is possible because Nova-2's streaming API uses a persistent WebSocket connection with incremental hypothesis updates rather than waiting for endpointing confirmation.
The LLM reasoning budget is the hardest to control. Gemini 2.0 Flash is selected specifically because its Time To First Token (TTFT) is substantially lower than frontier-class models — the target is receiving the first meaningful token cluster within 100ms of sending the prompt. Engineers must configure the LLM call to stream tokens, not await a complete response. Any blocking .generate() call here blows the entire budget.
Technical Warning: The 400ms perceptibility threshold applies to audio onset — the moment the user hears the first synthesized syllable. TTFB from the TTS engine must account for audio encoding, WebSocket framing, and client-side jitter buffer playback delay. Measure end-to-end with
time.perf_counter()at the client receive event, not at the server send event.
Orchestrating High-Speed Bidirectional Streams
Synchronous calls to TTS or LLM APIs inside a FastAPI WebSocket handler will cause WebSocket backpressure and frame drops. The handler must operate purely on async I/O, with all blocking operations dispatched to asyncio.TaskGroup or asyncio.gather patterns.
The following WebSocket manager handles concurrent inbound audio frames, dispatches to STT, and streams synthesized audio back — all within a single async event loop:
import asyncio
import json
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions
from google import genai
import redis.asyncio as aioredis
app = FastAPI()
# Initialize clients at module level to reuse connections across requests
deepgram_client = DeepgramClient(api_key="DG_API_KEY")
gemini_client = genai.Client(api_key="GEMINI_API_KEY")
redis_client = aioredis.from_url("redis://localhost:6379", decode_responses=True)
class VoiceSessionManager:
def __init__(self, session_id: str, websocket: WebSocket):
self.session_id = session_id
self.ws = websocket
self.tts_flush_event = asyncio.Event() # signals barge-in to abort TTS stream
self.audio_queue: asyncio.Queue = asyncio.Queue(maxsize=50)
async def handle_transcript(self, transcript: str):
"""Fires when Deepgram emits a final transcript; starts LLM + TTS pipeline."""
if not transcript.strip():
return
# Load context from Redis before every LLM call
context_raw = await redis_client.get(f"ctx:{self.session_id}")
history = json.loads(context_raw) if context_raw else []
history.append({"role": "user", "content": transcript})
# Stream Gemini 2.0 Flash tokens; do not await full completion
response_text = ""
token_buffer = ""
async for chunk in await gemini_client.aio.models.generate_content_stream(
model="gemini-2.0-flash",
contents=history,
):
if self.tts_flush_event.is_set():
# Barge-in detected mid-generation; abort immediately
break
token = chunk.text
token_buffer += token
response_text += token
# Dispatch TTS when buffer hits a sentence boundary for natural chunking
if any(token_buffer.endswith(p) for p in [".", "?", "!", ","]):
await self._synthesize_and_send(token_buffer.strip())
token_buffer = ""
# Flush any remaining buffer content
if token_buffer.strip() and not self.tts_flush_event.is_set():
await self._synthesize_and_send(token_buffer.strip())
# Persist updated history to Redis
history.append({"role": "assistant", "content": response_text})
await redis_client.setex(
f"ctx:{self.session_id}",
3600, # 1-hour TTL; expired sessions are evicted automatically
json.dumps(history[-20:]), # cap at last 20 turns to control token count
)
async def _synthesize_and_send(self, text: str):
"""Streams TTS audio chunks directly back over the WebSocket."""
# Placeholder: replace with your TTS provider's async streaming call
# e.g., ElevenLabs, Cartesia, or Google Cloud TTS streaming API
async for audio_chunk in tts_stream(text):
if self.tts_flush_event.is_set():
break # honor barge-in signal mid-synthesis
await self.ws.send_bytes(audio_chunk)
@app.websocket("/voice/{session_id}")
async def voice_endpoint(websocket: WebSocket, session_id: str):
await websocket.accept()
manager = VoiceSessionManager(session_id, websocket)
# Configure Deepgram streaming with endpointing enabled
dg_connection = deepgram_client.listen.asynclive.v("1")
options = LiveOptions(
model="nova-2",
language="en-US",
smart_format=True,
endpointing=300, # ms of silence to confirm utterance end
interim_results=True, # emit partial results for low-latency LLM priming
)
@dg_connection.on(LiveTranscriptionEvents.Transcript)
async def on_transcript(result, **kwargs):
sentence = result.channel.alternatives[0].transcript
if result.is_final:
manager.tts_flush_event.clear() # reset barge-in flag on new utterance
await manager.handle_transcript(sentence)
await dg_connection.start(options)
try:
async for message in websocket.iter_bytes():
# Forward raw PCM audio frames to Deepgram without blocking
await dg_connection.send(message)
except WebSocketDisconnect:
await dg_connection.finish()
Pro-Tip: Set
asyncio.Queue(maxsize=50)on the audio ingestion path. A bounded queue provides natural backpressure signaling — if the STT WebSocket falls behind, frames are dropped rather than accumulated, which is preferable to building a multi-second lag buffer that destroys real-time behavior.
Deterministic State Management in Stateless Cloud Environments
Google Cloud Run instances are ephemeral. Any in-process session state — conversation history, speaker diarization context, barge-in counters — evaporates when a request routes to a different instance. Redis solves this, and at <2ms per read/write operation, it adds negligible overhead to the pipeline.
import json
import time
from dataclasses import dataclass, asdict
from typing import Optional
import redis.asyncio as aioredis
redis_client = aioredis.from_url("redis://redis-host:6379", decode_responses=True)
@dataclass
class VoiceSessionState:
session_id: str
conversation_history: list # last N turns for LLM context window
speaker_turn: str # "user" | "agent" — tracks whose floor it is
barge_in_count: int # monotonically increasing; used for UX analytics
last_activity_ts: float # Unix timestamp; drives TTL refresh logic
vad_sensitivity: float # adaptive threshold, updated per session
async def load_state(session_id: str) -> Optional[VoiceSessionState]:
"""Fetch and deserialize session state; returns None for new sessions."""
raw = await redis_client.get(f"voice:state:{session_id}")
if raw is None:
return None
data = json.loads(raw)
return VoiceSessionState(**data)
async def persist_state(state: VoiceSessionState, ttl_seconds: int = 3600) -> None:
"""Serialize and write state with TTL; <2ms overhead in co-located Redis."""
state.last_activity_ts = time.time() # refresh timestamp on every write
await redis_client.setex(
f"voice:state:{state.session_id}",
ttl_seconds,
json.dumps(asdict(state)),
)
async def get_or_create_state(session_id: str) -> VoiceSessionState:
"""Idempotent state bootstrap; safe to call on every WebSocket connection."""
existing = await load_state(session_id)
if existing:
return existing
new_state = VoiceSessionState(
session_id=session_id,
conversation_history=[],
speaker_turn="user",
barge_in_count=0,
last_activity_ts=time.time(),
vad_sensitivity=0.5, # start at midpoint; adapt based on session noise floor
)
await persist_state(new_state)
return new_state
The speaker_turn field is the critical concurrency guard. When set to "agent", the orchestrator knows the agent is mid-utterance and barge-in detection must be armed. This state is written to Redis before TTS begins and cleared immediately upon a confirmed barge-in VAD signal — ensuring any Cloud Run instance handling the next audio frame reads the correct system state.
Precision Barge-In and Turn-Taking Logic
VAD sensitivity thresholding requires 100–200ms of audio buffer to confirm speech endpointing with acceptable false-positive rates. This creates a fundamental tension: arm barge-in detection too eagerly and the agent interrupts itself on background noise; arm it too conservatively and the user must wait for the agent to finish speaking before the floor can be reclaimed.
The resolution is a two-phase VAD gate combined with immediate buffer flushing:
flowchart TD
A[Inbound Audio Frame\n20ms chunk] --> B{VAD Score\n> threshold?}
B -- No --> C[Increment silence counter]
C --> D{Silence > endpointing\nthreshold 300ms?}
D -- No --> A
D -- Yes --> E[Emit utterance_end signal]
B -- Yes --> F{Agent currently\nspeaking?}
F -- No --> G[Normal STT forwarding]
G --> A
F -- Yes --> H{VAD confidence\n> 0.85 AND\nduration > 80ms?}
H -- No --> I[Suppress: treat as\nbackground noise]
I --> A
H -- Yes --> J[BARGE-IN CONFIRMED]
J --> K[Set tts_flush_event]
J --> L[Cancel active TTS stream]
J --> M[Clear audio playback buffer]
J --> N[Increment barge_in_count\nin Redis state]
K & L & M & N --> O[Reset speaker_turn\nto 'user' in Redis]
O --> P[Resume STT forwarding\nfor new utterance]
P --> A
The agent must implement immediate buffer flushing upon a confirmed VAD interrupt signal. Any audio frames already queued for TTS synthesis must be discarded — not played — to prevent the client from hearing the agent's words overlapping with the user's new utterance. This is implemented via the tts_flush_event asyncio Event in the code above: every TTS synthesis coroutine checks this flag before sending each chunk.
Technical Warning: Do not rely solely on WebRTC's built-in Discontinuous Transmission (DTX) for barge-in detection. DTX suppresses silence at the codec level but does not emit application-layer signals your orchestrator can act on. Implement VAD as an independent server-side process that runs on the decoded PCM stream, not on the RTP/SRTP layer.
Turn-taking correctness also depends on tracking who holds the conversational floor via the speaker_turn Redis field. A barge-in event is only valid when speaker_turn == "agent". When the agent is silent and the user speaks, that is normal turn-taking — not an interrupt — and should not trigger the flush pipeline.
Mitigating Audio Echo and Jitter in Full-Duplex Loops
Full-duplex audio loops create acoustic echo because the microphone captures speaker output. Software-based Acoustic Echo Cancellation (AEC) — available natively in WebRTC's audio processing stack — handles this at the client, but server-side echo suppression is required for environments where raw PCM is streamed from non-browser clients (embedded devices, SIP gateways).
Server-side AEC operates by maintaining a reference signal of what the agent is playing and subtracting it from the incoming microphone stream using an adaptive filter. The computational overhead of a standard NLMS (Normalized Least Mean Squares) filter on 20ms frames at 16kHz is under 1ms on modern hardware — effectively free relative to the latency budget.
Jitter is the more insidious problem. Packet arrival variance in WebSocket-over-TCP streams causes discontinuous audio rendering. The adaptive jitter buffer target delay is calculated as:
$$J = \bar{T} + K \cdot \sigma_T$$
Where: - $J$ = target jitter buffer size (in milliseconds) - $\bar{T}$ = mean packet inter-arrival time over a sliding window (typically last 50 packets) - $\sigma_T$ = standard deviation of packet inter-arrival times - $K$ = scaling factor (typically 2–3); higher values trade latency for stability
Set $K = 2$ for co-located datacenter deployments. Set $K = 3$ for consumer internet conditions. The buffer must be dynamically resized every 500ms based on updated $\bar{T}$ and $\sigma_T$ — a static jitter buffer sized for worst-case conditions adds 40–80ms of unnecessary fixed latency.
Pro-Tip: Run jitter buffer calculations on the client side (browser or native app) using the WebRTC
RTCRtpReceiver.getStats()API. ThejitterBufferDelayandjitterBufferEmittedCountfields provide the raw data needed to compute $\sigma_T$ without custom instrumentation.
Scaling the Intelligent Edge
Gemini 2.0 Flash demonstrated measurably improved multimodal reasoning throughput compared to Gemini 1.5 Pro in developer benchmarks, with lower TTFT enabling the sub-100ms reasoning budget target. As Google DeepMind notes, "Gemini 3.1 Flash Live is a state-of-the-art tool for character-driven, real-time experiences - Google DeepMind." Production voice agents on Kubernetes or Cloud Run must account for per-user WebSocket overhead — each persistent connection consumes a file descriptor, an async task, and a Redis connection pool slot.
The following benchmark harness measures end-to-end pipeline latency across all components under concurrent load:
import asyncio
import time
import statistics
from dataclasses import dataclass, field
from typing import List
import websockets
@dataclass
class LatencyRecord:
session_id: str
stt_ms: float = 0.0
llm_ttft_ms: float = 0.0
tts_first_chunk_ms: float = 0.0
e2e_ms: float = 0.0
@dataclass
class BenchmarkReport:
records: List[LatencyRecord] = field(default_factory=list)
def summary(self) -> dict:
e2e = [r.e2e_ms for r in self.records]
stt = [r.stt_ms for r in self.records]
llm = [r.llm_ttft_ms for r in self.records]
tts = [r.tts_first_chunk_ms for r in self.records]
return {
"p50_e2e_ms": statistics.median(e2e),
"p95_e2e_ms": statistics.quantiles(e2e, n=20)[18], # 95th percentile
"p99_e2e_ms": statistics.quantiles(e2e, n=100)[98],
"avg_stt_ms": statistics.mean(stt),
"avg_llm_ttft_ms": statistics.mean(llm),
"avg_tts_first_chunk_ms": statistics.mean(tts),
}
async def run_single_probe(
endpoint: str, session_id: str, audio_payload: bytes
) -> LatencyRecord:
"""Sends a single audio payload and records per-component timing."""
record = LatencyRecord(session_id=session_id)
t_start = time.perf_counter()
async with websockets.connect(endpoint) as ws:
await ws.send(audio_payload)
stt_received = False
tts_received = False
async for message in ws:
t_now = time.perf_counter()
elapsed_ms = (t_now - t_start) * 1000
if isinstance(message, str):
data = __import__("json").loads(message)
if data.get("type") == "transcript_final" and not stt_received:
record.stt_ms = elapsed_ms
stt_received = True
elif data.get("type") == "llm_token_first" and not stt_received is False:
record.llm_ttft_ms = elapsed_ms - record.stt_ms
elif isinstance(message, bytes) and not tts_received:
# First binary frame = first TTS audio chunk
record.tts_first_chunk_ms = elapsed_ms - record.stt_ms - record.llm_ttft_ms
record.e2e_ms = elapsed_ms
tts_received = True
break # probe complete; close connection
return record
async def run_benchmark(
endpoint: str, concurrency: int, iterations: int, audio_payload: bytes
) -> BenchmarkReport:
report = BenchmarkReport()
semaphore = asyncio.Semaphore(concurrency) # cap parallel WebSocket connections
async def probe_with_limit(i: int) -> None:
async with semaphore:
record = await run_single_probe(endpoint, f"bench-{i}", audio_payload)
report.records.append(record)
await asyncio.gather(*[probe_with_limit(i) for i in range(iterations)])
return report
# Entry point: run with concurrency=50, 200 total probes
if __name__ == "__main__":
AUDIO = open("test_utterance_16khz_pcm.raw", "rb").read()
report = asyncio.run(
run_benchmark("wss://your-agent.example.com/voice/test", 50, 200, AUDIO)
)
print(report.summary())
API rate limits for Gemini Flash are enforced at 1,000–2,000 RPM per project. At 50 concurrent sessions with an average 3-second turn duration, that translates to approximately 1,000 RPM — right at the limit boundary. Engineers must implement exponential backoff with jitter on Gemini API calls and design the orchestrator to gracefully degrade (e.g., return a filler phrase via TTS while retrying) rather than surface a raw 429 error to the voice session.
Future-Proofing Voice Agent UX
Human-like voice interaction is a systems property, not a model property. The agent's perceived naturalness is determined by the consistency of sub-400ms response onset, the accuracy of barge-in detection across noisy environments, and the coherence of context across sessions — none of which are solved by a better LLM alone.
The ASR and TTS providers must be hot-swappable via standardized WebSocket interfaces. Deepgram Nova-2 today, a self-hosted Whisper variant tomorrow — the orchestration layer should not require rewriting when provider economics shift. Define a protocol-level contract for what a transcript_final event looks like, and enforce it via an adapter pattern in the gateway.
Architectural Bottleneck Evaluation Checklist:
- [ ] STT Latency: Are partial/interim results enabled? Is the Deepgram WebSocket connection persistent (not re-established per utterance)?
- [ ] LLM Streaming: Is the Gemini call using async streaming (
generate_content_stream)? Is TTFT measured independently from full response completion? - [ ] TTS Chunking: Does TTS synthesis start on sentence-boundary token clusters, not on full LLM response completion?
- [ ] Barge-In Correctness: Is
tts_flush_eventchecked in every TTS send loop iteration? Isspeaker_turnpersisted to Redis before the first TTS audio byte is sent? - [ ] Redis Latency: Is Redis co-located in the same region as Cloud Run? Are connection pools pre-warmed (not opened per-request)?
- [ ] Jitter Buffer: Is the jitter buffer target recalculated every 500ms using a sliding window? Is $K$ configurable per deployment environment?
- [ ] Echo Cancellation: Is AEC running client-side via WebRTC for browser clients? Is server-side NLMS AEC implemented for non-browser ingestion paths?
- [ ] Rate Limit Headroom: Is Gemini Flash RPM utilization monitored? Is exponential backoff with filler-phrase fallback implemented?
- [ ] Provider Decoupling: Can the STT or TTS provider be swapped by changing an adapter class without modifying orchestration logic?
- [ ] Load Testing: Have p95 and p99 end-to-end latency been measured under target concurrency using a harness similar to the benchmark above?
The systems that will define conversational AI infrastructure over the next two years are not the ones with the most capable models — they are the ones that have solved the plumbing. Sub-300ms TTFB is achievable today with the stack described here. The ceiling is now UX nuance: prosody-aware TTS, emotionally adaptive response pacing, and turn-taking models trained on real conversation dynamics rather than text corpora.
Keywords: Voice Activity Detection (VAD), Time To First Byte (TTFB), WebRTC, WebSocket-based orchestration, Echo suppression, Jitter buffering, Deepgram Nova-2, Gemini 2.0 Flash SDK, LangGraph, Endpointing logic, Deterministic state machine, Asynchronous I/O, FastAPI, Redis