AI & ML

Architecting Low-Latency Full-Duplex Voice Agents: A Technical Breakdown of Barge-In and Turn-Taking

By implementing a streaming-first architecture with WebSocket-based orchestration, engineers can achieve a Time To First Byte (TTFB) under 300ms, though this requires aggressive jitter buffering and deterministic echo suppression to maintain coherence.

By AxiomLogica Editorial

Apr 10, 202616 min read

Reviewed by Editorial

Full-duplex voice agents fail at the systems level before they fail at the model level. The dominant failure mode is not insufficient intelligence — it is cumulative pipeline latency that breaches the 400ms human-perceptibility threshold, combined with barge-in logic that either fires too aggressively or not at all. This article dissects the architectural decisions that determine whether a voice agent feels like a conversation or a walkie-talkie.

The implementation stack covered here: FastAPI for async WebSocket orchestration, Deepgram Nova-2 for streaming ASR, Gemini 2.0 Flash for reasoning, a standard TTS provider, and Redis for cross-instance state persistence on Google Cloud Run.

Deconstructing the Streaming-First Voice Architecture

Streaming ASR models reduce transcription latency from 200–400ms (batch mode) to 100–200ms when partial result emission is enabled. That 50–100ms difference is not marginal — it directly determines whether the LLM reasoning phase can begin before the user finishes speaking, which is the core mechanic enabling human-like response timing.

The pipeline is a directed acyclic graph of streaming stages, not a sequential request-response chain. Audio frames flow in at 20ms intervals, VAD runs per-frame, partial transcripts feed the LLM incrementally, and TTS synthesis begins on the first coherent token cluster — not after the full LLM response is complete.

sequenceDiagram
    participant Client as WebRTC Client
    participant GW as FastAPI WebSocket Gateway
    participant VAD as VAD Engine
    participant STT as Deepgram Nova-2 (STT)
    participant LLM as Gemini 2.0 Flash
    participant TTS as TTS Engine
    participant Redis as Redis State Store

    Client->>GW: Audio Frame (20ms chunks)
    GW->>VAD: Forward frame for endpointing
    VAD-->>GW: Speech / Silence / Interrupt signal
    GW->>STT: Stream audio frames (WebSocket)
    STT-->>GW: Partial transcript tokens
    GW->>Redis: Load conversation context
    Redis-->>GW: Serialized session state
    GW->>LLM: Incremental prompt + context
    LLM-->>GW: Streaming token chunks
    GW->>TTS: Token clusters → synthesis request
    TTS-->>GW: Audio frame chunks (streaming)
    GW->>Client: PCM audio stream
    Note over VAD,GW: Interrupt path: VAD fires → GW flushes TTS buffer immediately

WebSocket connections in this architecture require explicit heartbeat signals — typically a ping/pong frame every 15–20 seconds — to prevent load balancer timeout termination on GCP and AWS. This is non-negotiable in stateless container environments; a dropped connection mid-utterance destroys session context unless Redis backup is in place.

Establishing the Latency Budget for Multimodal Pipelines

Every component in the pipeline owns a slice of a 300–500ms total budget. The allocation is not arbitrary — it reflects the minimum achievable latency floor for each component class given current model architectures and network physics.

Component	Target Budget	Minimum Achievable	Failure Mode if Exceeded
STT (Deepgram Nova-2)	100–150ms	~80ms (streaming)	LLM reasoning starts late; response feels sluggish
LLM Reasoning (Gemini 2.0 Flash)	100–200ms	~80ms (TTFB)	Most expensive slice; controls perceived intelligence
TTS Synthesis	100–150ms	~60ms (streaming first chunk)	Audio onset delay is perceptually salient
Network RTT (WebSocket)	20–40ms	~10ms (co-located)	Compounded across both audio directions
VAD + Orchestration	<15ms	~5ms	Barge-in detection degrades below 50ms response
Redis State I/O	<5ms	~1–2ms	Negligible; Redis adds <2ms per operation
Total Target	≤300ms TTFB	~250ms	Perceptible lag above 400ms; agent feels broken

Deepgram Nova-2 achieves streaming transcription latency as low as 150ms under real-world network conditions, with partial result emission beginning within the first 100ms of voiced audio. This is possible because Nova-2's streaming API uses a persistent WebSocket connection with incremental hypothesis updates rather than waiting for endpointing confirmation.

The LLM reasoning budget is the hardest to control. Gemini 2.0 Flash is selected specifically because its Time To First Token (TTFT) is substantially lower than frontier-class models — the target is receiving the first meaningful token cluster within 100ms of sending the prompt. Engineers must configure the LLM call to stream tokens, not await a complete response. Any blocking .generate() call here blows the entire budget.

Technical Warning: The 400ms perceptibility threshold applies to audio onset — the moment the user hears the first synthesized syllable. TTFB from the TTS engine must account for audio encoding, WebSocket framing, and client-side jitter buffer playback delay. Measure end-to-end with time.perf_counter() at the client receive event, not at the server send event.

Orchestrating High-Speed Bidirectional Streams

Synchronous calls to TTS or LLM APIs inside a FastAPI WebSocket handler will cause WebSocket backpressure and frame drops. The handler must operate purely on async I/O, with all blocking operations dispatched to asyncio.TaskGroup or asyncio.gather patterns.

The following WebSocket manager handles concurrent inbound audio frames, dispatches to STT, and streams synthesized audio back — all within a single async event loop:

import asyncio
import json
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions
from google import genai
import redis.asyncio as aioredis

app = FastAPI()

# Initialize clients at module level to reuse connections across requests
deepgram_client = DeepgramClient(api_key="DG_API_KEY")
gemini_client = genai.Client(api_key="GEMINI_API_KEY")
redis_client = aioredis.from_url("redis://localhost:6379", decode_responses=True)


class VoiceSessionManager:
    def __init__(self, session_id: str, websocket: WebSocket):
        self.session_id = session_id
        self.ws = websocket
        self.tts_flush_event = asyncio.Event()  # signals barge-in to abort TTS stream
        self.audio_queue: asyncio.Queue = asyncio.Queue(maxsize=50)

    async def handle_transcript(self, transcript: str):
        """Fires when Deepgram emits a final transcript; starts LLM + TTS pipeline."""
        if not transcript.strip():
            return

        # Load context from Redis before every LLM call
        context_raw = await redis_client.get(f"ctx:{self.session_id}")
        history = json.loads(context_raw) if context_raw else []

        history.append({"role": "user", "content": transcript})

        # Stream Gemini 2.0 Flash tokens; do not await full completion
        response_text = ""
        token_buffer = ""

        async for chunk in await gemini_client.aio.models.generate_content_stream(
            model="gemini-2.0-flash",
            contents=history,
        ):
            if self.tts_flush_event.is_set():
                # Barge-in detected mid-generation; abort immediately
                break

            token = chunk.text
            token_buffer += token
            response_text += token

            # Dispatch TTS when buffer hits a sentence boundary for natural chunking
            if any(token_buffer.endswith(p) for p in [".", "?", "!", ","]):
                await self._synthesize_and_send(token_buffer.strip())
                token_buffer = ""

        # Flush any remaining buffer content
        if token_buffer.strip() and not self.tts_flush_event.is_set():
            await self._synthesize_and_send(token_buffer.strip())

        # Persist updated history to Redis
        history.append({"role": "assistant", "content": response_text})
        await redis_client.setex(
            f"ctx:{self.session_id}",
            3600,  # 1-hour TTL; expired sessions are evicted automatically
            json.dumps(history[-20:]),  # cap at last 20 turns to control token count
        )

    async def _synthesize_and_send(self, text: str):
        """Streams TTS audio chunks directly back over the WebSocket."""
        # Placeholder: replace with your TTS provider's async streaming call
        # e.g., ElevenLabs, Cartesia, or Google Cloud TTS streaming API
        async for audio_chunk in tts_stream(text):
            if self.tts_flush_event.is_set():
                break  # honor barge-in signal mid-synthesis
            await self.ws.send_bytes(audio_chunk)


@app.websocket("/voice/{session_id}")
async def voice_endpoint(websocket: WebSocket, session_id: str):
    await websocket.accept()
    manager = VoiceSessionManager(session_id, websocket)

    # Configure Deepgram streaming with endpointing enabled
    dg_connection = deepgram_client.listen.asynclive.v("1")
    options = LiveOptions(
        model="nova-2",
        language="en-US",
        smart_format=True,
        endpointing=300,        # ms of silence to confirm utterance end
        interim_results=True,   # emit partial results for low-latency LLM priming
    )

    @dg_connection.on(LiveTranscriptionEvents.Transcript)
    async def on_transcript(result, **kwargs):
        sentence = result.channel.alternatives[0].transcript
        if result.is_final:
            manager.tts_flush_event.clear()  # reset barge-in flag on new utterance
            await manager.handle_transcript(sentence)

    await dg_connection.start(options)

    try:
        async for message in websocket.iter_bytes():
            # Forward raw PCM audio frames to Deepgram without blocking
            await dg_connection.send(message)
    except WebSocketDisconnect:
        await dg_connection.finish()

Pro-Tip: Set asyncio.Queue(maxsize=50) on the audio ingestion path. A bounded queue provides natural backpressure signaling — if the STT WebSocket falls behind, frames are dropped rather than accumulated, which is preferable to building a multi-second lag buffer that destroys real-time behavior.

Deterministic State Management in Stateless Cloud Environments

Google Cloud Run instances are ephemeral. Any in-process session state — conversation history, speaker diarization context, barge-in counters — evaporates when a request routes to a different instance. Redis solves this, and at <2ms per read/write operation, it adds negligible overhead to the pipeline.

import json
import time
from dataclasses import dataclass, asdict
from typing import Optional
import redis.asyncio as aioredis

redis_client = aioredis.from_url("redis://redis-host:6379", decode_responses=True)


@dataclass
class VoiceSessionState:
    session_id: str
    conversation_history: list       # last N turns for LLM context window
    speaker_turn: str                # "user" | "agent" — tracks whose floor it is
    barge_in_count: int              # monotonically increasing; used for UX analytics
    last_activity_ts: float          # Unix timestamp; drives TTL refresh logic
    vad_sensitivity: float           # adaptive threshold, updated per session


async def load_state(session_id: str) -> Optional[VoiceSessionState]:
    """Fetch and deserialize session state; returns None for new sessions."""
    raw = await redis_client.get(f"voice:state:{session_id}")
    if raw is None:
        return None
    data = json.loads(raw)
    return VoiceSessionState(**data)


async def persist_state(state: VoiceSessionState, ttl_seconds: int = 3600) -> None:
    """Serialize and write state with TTL; <2ms overhead in co-located Redis."""
    state.last_activity_ts = time.time()  # refresh timestamp on every write
    await redis_client.setex(
        f"voice:state:{state.session_id}",
        ttl_seconds,
        json.dumps(asdict(state)),
    )


async def get_or_create_state(session_id: str) -> VoiceSessionState:
    """Idempotent state bootstrap; safe to call on every WebSocket connection."""
    existing = await load_state(session_id)
    if existing:
        return existing
    new_state = VoiceSessionState(
        session_id=session_id,
        conversation_history=[],
        speaker_turn="user",
        barge_in_count=0,
        last_activity_ts=time.time(),
        vad_sensitivity=0.5,  # start at midpoint; adapt based on session noise floor
    )
    await persist_state(new_state)
    return new_state

The speaker_turn field is the critical concurrency guard. When set to "agent", the orchestrator knows the agent is mid-utterance and barge-in detection must be armed. This state is written to Redis before TTS begins and cleared immediately upon a confirmed barge-in VAD signal — ensuring any Cloud Run instance handling the next audio frame reads the correct system state.

Precision Barge-In and Turn-Taking Logic

VAD sensitivity thresholding requires 100–200ms of audio buffer to confirm speech endpointing with acceptable false-positive rates. This creates a fundamental tension: arm barge-in detection too eagerly and the agent interrupts itself on background noise; arm it too conservatively and the user must wait for the agent to finish speaking before the floor can be reclaimed.

The resolution is a two-phase VAD gate combined with immediate buffer flushing:

flowchart TD
    A[Inbound Audio Frame\n20ms chunk] --> B{VAD Score\n> threshold?}
    B -- No --> C[Increment silence counter]
    C --> D{Silence > endpointing\nthreshold 300ms?}
    D -- No --> A
    D -- Yes --> E[Emit utterance_end signal]
    B -- Yes --> F{Agent currently\nspeaking?}
    F -- No --> G[Normal STT forwarding]
    G --> A
    F -- Yes --> H{VAD confidence\n> 0.85 AND\nduration > 80ms?}
    H -- No --> I[Suppress: treat as\nbackground noise]
    I --> A
    H -- Yes --> J[BARGE-IN CONFIRMED]
    J --> K[Set tts_flush_event]
    J --> L[Cancel active TTS stream]
    J --> M[Clear audio playback buffer]
    J --> N[Increment barge_in_count\nin Redis state]
    K & L & M & N --> O[Reset speaker_turn\nto 'user' in Redis]
    O --> P[Resume STT forwarding\nfor new utterance]
    P --> A

The agent must implement immediate buffer flushing upon a confirmed VAD interrupt signal. Any audio frames already queued for TTS synthesis must be discarded — not played — to prevent the client from hearing the agent's words overlapping with the user's new utterance. This is implemented via the tts_flush_event asyncio Event in the code above: every TTS synthesis coroutine checks this flag before sending each chunk.

Technical Warning: Do not rely solely on WebRTC's built-in Discontinuous Transmission (DTX) for barge-in detection. DTX suppresses silence at the codec level but does not emit application-layer signals your orchestrator can act on. Implement VAD as an independent server-side process that runs on the decoded PCM stream, not on the RTP/SRTP layer.

Turn-taking correctness also depends on tracking who holds the conversational floor via the speaker_turn Redis field. A barge-in event is only valid when speaker_turn == "agent". When the agent is silent and the user speaks, that is normal turn-taking — not an interrupt — and should not trigger the flush pipeline.

Mitigating Audio Echo and Jitter in Full-Duplex Loops

Full-duplex audio loops create acoustic echo because the microphone captures speaker output. Software-based Acoustic Echo Cancellation (AEC) — available natively in WebRTC's audio processing stack — handles this at the client, but server-side echo suppression is required for environments where raw PCM is streamed from non-browser clients (embedded devices, SIP gateways).

Server-side AEC operates by maintaining a reference signal of what the agent is playing and subtracting it from the incoming microphone stream using an adaptive filter. The computational overhead of a standard NLMS (Normalized Least Mean Squares) filter on 20ms frames at 16kHz is under 1ms on modern hardware — effectively free relative to the latency budget.

Jitter is the more insidious problem. Packet arrival variance in WebSocket-over-TCP streams causes discontinuous audio rendering. The adaptive jitter buffer target delay is calculated as:

$$J = \bar{T} + K \cdot \sigma_T$$

Where: - $J$ = target jitter buffer size (in milliseconds) - $\bar{T}$ = mean packet inter-arrival time over a sliding window (typically last 50 packets) - $\sigma_T$ = standard deviation of packet inter-arrival times - $K$ = scaling factor (typically 2–3); higher values trade latency for stability

Set $K = 2$ for co-located datacenter deployments. Set $K = 3$ for consumer internet conditions. The buffer must be dynamically resized every 500ms based on updated $\bar{T}$ and $\sigma_T$ — a static jitter buffer sized for worst-case conditions adds 40–80ms of unnecessary fixed latency.

Pro-Tip: Run jitter buffer calculations on the client side (browser or native app) using the WebRTC RTCRtpReceiver.getStats() API. The jitterBufferDelay and jitterBufferEmittedCount fields provide the raw data needed to compute $\sigma_T$ without custom instrumentation.

Scaling the Intelligent Edge

Gemini 2.0 Flash demonstrated measurably improved multimodal reasoning throughput compared to Gemini 1.5 Pro in developer benchmarks, with lower TTFT enabling the sub-100ms reasoning budget target. As Google DeepMind notes, "Gemini 3.1 Flash Live is a state-of-the-art tool for character-driven, real-time experiences - Google DeepMind." Production voice agents on Kubernetes or Cloud Run must account for per-user WebSocket overhead — each persistent connection consumes a file descriptor, an async task, and a Redis connection pool slot.

The following benchmark harness measures end-to-end pipeline latency across all components under concurrent load:

import asyncio
import time
import statistics
from dataclasses import dataclass, field
from typing import List
import websockets


@dataclass
class LatencyRecord:
    session_id: str
    stt_ms: float = 0.0
    llm_ttft_ms: float = 0.0
    tts_first_chunk_ms: float = 0.0
    e2e_ms: float = 0.0


@dataclass
class BenchmarkReport:
    records: List[LatencyRecord] = field(default_factory=list)

    def summary(self) -> dict:
        e2e = [r.e2e_ms for r in self.records]
        stt = [r.stt_ms for r in self.records]
        llm = [r.llm_ttft_ms for r in self.records]
        tts = [r.tts_first_chunk_ms for r in self.records]
        return {
            "p50_e2e_ms": statistics.median(e2e),
            "p95_e2e_ms": statistics.quantiles(e2e, n=20)[18],  # 95th percentile
            "p99_e2e_ms": statistics.quantiles(e2e, n=100)[98],
            "avg_stt_ms": statistics.mean(stt),
            "avg_llm_ttft_ms": statistics.mean(llm),
            "avg_tts_first_chunk_ms": statistics.mean(tts),
        }


async def run_single_probe(
    endpoint: str, session_id: str, audio_payload: bytes
) -> LatencyRecord:
    """Sends a single audio payload and records per-component timing."""
    record = LatencyRecord(session_id=session_id)
    t_start = time.perf_counter()

    async with websockets.connect(endpoint) as ws:
        await ws.send(audio_payload)

        stt_received = False
        tts_received = False

        async for message in ws:
            t_now = time.perf_counter()
            elapsed_ms = (t_now - t_start) * 1000

            if isinstance(message, str):
                data = __import__("json").loads(message)
                if data.get("type") == "transcript_final" and not stt_received:
                    record.stt_ms = elapsed_ms
                    stt_received = True
                elif data.get("type") == "llm_token_first" and not stt_received is False:
                    record.llm_ttft_ms = elapsed_ms - record.stt_ms
            elif isinstance(message, bytes) and not tts_received:
                # First binary frame = first TTS audio chunk
                record.tts_first_chunk_ms = elapsed_ms - record.stt_ms - record.llm_ttft_ms
                record.e2e_ms = elapsed_ms
                tts_received = True
                break  # probe complete; close connection

    return record


async def run_benchmark(
    endpoint: str, concurrency: int, iterations: int, audio_payload: bytes
) -> BenchmarkReport:
    report = BenchmarkReport()
    semaphore = asyncio.Semaphore(concurrency)  # cap parallel WebSocket connections

    async def probe_with_limit(i: int) -> None:
        async with semaphore:
            record = await run_single_probe(endpoint, f"bench-{i}", audio_payload)
            report.records.append(record)

    await asyncio.gather(*[probe_with_limit(i) for i in range(iterations)])
    return report


# Entry point: run with concurrency=50, 200 total probes
if __name__ == "__main__":
    AUDIO = open("test_utterance_16khz_pcm.raw", "rb").read()
    report = asyncio.run(
        run_benchmark("wss://your-agent.example.com/voice/test", 50, 200, AUDIO)
    )
    print(report.summary())

API rate limits for Gemini Flash are enforced at 1,000–2,000 RPM per project. At 50 concurrent sessions with an average 3-second turn duration, that translates to approximately 1,000 RPM — right at the limit boundary. Engineers must implement exponential backoff with jitter on Gemini API calls and design the orchestrator to gracefully degrade (e.g., return a filler phrase via TTS while retrying) rather than surface a raw 429 error to the voice session.

Future-Proofing Voice Agent UX

Human-like voice interaction is a systems property, not a model property. The agent's perceived naturalness is determined by the consistency of sub-400ms response onset, the accuracy of barge-in detection across noisy environments, and the coherence of context across sessions — none of which are solved by a better LLM alone.

The ASR and TTS providers must be hot-swappable via standardized WebSocket interfaces. Deepgram Nova-2 today, a self-hosted Whisper variant tomorrow — the orchestration layer should not require rewriting when provider economics shift. Define a protocol-level contract for what a transcript_final event looks like, and enforce it via an adapter pattern in the gateway.

Architectural Bottleneck Evaluation Checklist:

[ ] STT Latency: Are partial/interim results enabled? Is the Deepgram WebSocket connection persistent (not re-established per utterance)?
[ ] LLM Streaming: Is the Gemini call using async streaming (generate_content_stream)? Is TTFT measured independently from full response completion?
[ ] TTS Chunking: Does TTS synthesis start on sentence-boundary token clusters, not on full LLM response completion?
[ ] Barge-In Correctness: Is tts_flush_event checked in every TTS send loop iteration? Is speaker_turn persisted to Redis before the first TTS audio byte is sent?
[ ] Redis Latency: Is Redis co-located in the same region as Cloud Run? Are connection pools pre-warmed (not opened per-request)?
[ ] Jitter Buffer: Is the jitter buffer target recalculated every 500ms using a sliding window? Is $K$ configurable per deployment environment?
[ ] Echo Cancellation: Is AEC running client-side via WebRTC for browser clients? Is server-side NLMS AEC implemented for non-browser ingestion paths?
[ ] Rate Limit Headroom: Is Gemini Flash RPM utilization monitored? Is exponential backoff with filler-phrase fallback implemented?
[ ] Provider Decoupling: Can the STT or TTS provider be swapped by changing an adapter class without modifying orchestration logic?
[ ] Load Testing: Have p95 and p99 end-to-end latency been measured under target concurrency using a harness similar to the benchmark above?

The systems that will define conversational AI infrastructure over the next two years are not the ones with the most capable models — they are the ones that have solved the plumbing. Sub-300ms TTFB is achievable today with the stack described here. The ceiling is now UX nuance: prosody-aware TTS, emotionally adaptive response pacing, and turn-taking models trained on real conversation dynamics rather than text corpora.

Keywords: Voice Activity Detection (VAD), Time To First Byte (TTFB), WebRTC, WebSocket-based orchestration, Echo suppression, Jitter buffering, Deepgram Nova-2, Gemini 2.0 Flash SDK, LangGraph, Endpointing logic, Deterministic state machine, Asynchronous I/O, FastAPI, Redis

Was this guide helpful?

Share: X · LinkedIn · Reddit

Deconstructing the Streaming-First Voice Architecture

Establishing the Latency Budget for Multimodal Pipelines

Orchestrating High-Speed Bidirectional Streams

Deterministic State Management in Stateless Cloud Environments

Precision Barge-In and Turn-Taking Logic

Mitigating Audio Echo and Jitter in Full-Duplex Loops

Scaling the Intelligent Edge

Future-Proofing Voice Agent UX

The weekly brief.

Related reading

Optimizing LLM Inference: Implementing AWQ and Speculative Decoding for Production Latency

Architecting Multi-Agent Systems: A Deep Dive into the Two-Agent Initializer-Coder Pattern

Implementing Deterministic Agentic RAG with Stateful Graph Orchestration