AI & ML

Implementing Council Mode: A Multi-Agent Consensus Architecture for Reducing LLM Hallucination

By utilizing the Council Mode multi-agent consensus framework, engineers can achieve a 35.9% relative reduction in hallucination rates on the HaluEval benchmark, albeit at the cost of increased latency due to parallel inference across heterogeneous models.

By AxiomLogica Editorial

Apr 22, 202616 min read

Reviewed by Editorial

Single-model inference has a hard ceiling. No matter how aggressively you prompt-engineer a frontier model, its hallucination rate is bounded by its training distribution and its inability to self-verify against independent reasoning paths. Council Mode, formalized in ArXiv:2604.02923, breaks that ceiling by treating factual synthesis as a consensus problem—routing every query through architecturally heterogeneous models and reconciling their outputs via a structured Chairman synthesis step. The result is a 35.9% relative reduction in hallucination rates on HaluEval and a 7.8-point gain on TruthfulQA versus the best-performing individual model in the council. The cost is real: you're running N parallel inference calls. This guide covers everything required to implement it in production.

The Architecture of Consensus: Beyond Simple Prompting

Single-model agentic architectures fail in high-reliability scenarios for a structural reason: any error in the model's internal representation propagates unchecked into the final output. There is no external verification loop. Confidence scores are not calibrated probabilities; they are logit distributions that measure fluency, not factual accuracy.

Council Mode imposes a three-phase verification structure that separates query interpretation, evidence generation, and synthesis into discrete, independently auditable steps.

flowchart TD
    A([Incoming Query]) --> B[Phase 1: Triage Agent]
    B --> C{Task Decomposition}
    C --> D[Expert Prompt A\nClaude Opus]
    C --> E[Expert Prompt B\nGemini Pro]
    C --> F[Expert Prompt C\nGPT-4o]
    D --> G[(Parallel Async Execution)]
    E --> G
    F --> G
    G --> H{Response Collector\nTimeout / Backoff Handler}
    H --> I[Phase 3: Chairman Model\nStructured Synthesis]
    I --> J[Weighted Consensus JSON]
    J --> K([Final Verified Response])
    H -- Timeout/429 --> L[Re-queue with Backoff]
    L --> G

Phase 1 (Triage): A lightweight router model parses the incoming query, classifies its reasoning type (factual, analytical, generative), and constructs per-model prompts tuned to each expert's known strengths.

Phase 2 (Parallel Expert Generation): N architecturally diverse models execute concurrently. Architectural diversity is non-negotiable—querying three GPT-4o instances provides replication, not independent verification.

Phase 3 (Structured Consensus Synthesis): A designated Chairman model receives all N responses and synthesizes a final answer using a structured prompt that forces explicit conflict resolution and confidence attribution.

Technical Warning: The three-phase structure only reduces hallucination when the expert pool is architecturally diverse. Running multiple checkpoints of the same base model (e.g., two Claude variants) shares training distribution artifacts and will not replicate the benchmark gains.

Phase 1: Triage and Query Distribution

The Triage agent's job is not merely to route—it is to reformulate. A single user query often contains multiple implicit sub-claims that benefit from targeted prompt framing per model family. A factual lookup framed as an instruction to Claude should differ structurally from the same lookup sent to Gemini, which excels at multi-step chain-of-thought when prompted with explicit step separators.

import json
from dataclasses import dataclass, field
from typing import Literal

ModelFamily = Literal["claude", "gemini", "gpt4o"]

@dataclass
class ExpertTask:
    model_id: str
    family: ModelFamily
    prompt: str
    task_type: str
    metadata: dict = field(default_factory=dict)

def triage_query(raw_query: str, context: str = "") -> list[ExpertTask]:
    """
    Decomposes a user query into family-specific expert prompts.
    Prompt framing is tuned per model family's documented strengths.
    """
    base_context = f"Context: {context}\n\n" if context else ""

    tasks = [
        ExpertTask(
            model_id="anthropic/claude-opus-4",
            family="claude",
            # Claude performs well with direct factual assertions + caveats
            prompt=(
                f"{base_context}Answer the following with explicit confidence "
                f"qualifications for any claim that is not verifiable from "
                f"widely-established fact:\n\n{raw_query}"
            ),
            task_type="factual_with_hedging",
        ),
        ExpertTask(
            model_id="google/gemini-2.5-pro",
            family="gemini",
            # Gemini handles multi-step decomposition with labeled steps
            prompt=(
                f"{base_context}Step 1: Identify the core factual claims in "
                f"the question. Step 2: Answer each claim separately. "
                f"Step 3: Synthesize a final answer.\n\nQuestion: {raw_query}"
            ),
            task_type="chain_of_thought",
        ),
        ExpertTask(
            model_id="openai/gpt-4o",
            family="gpt4o",
            # GPT-4o benefits from structured output framing upfront
            prompt=(
                f"{base_context}Respond in two sections: "
                f"[REASONING] and [ANSWER]. "
                f"In [REASONING], surface any assumptions you are making. "
                f"In [ANSWER], state your conclusion.\n\nQuery: {raw_query}"
            ),
            task_type="structured_reasoning",
        ),
    ]
    return tasks

The Triage layer also writes task metadata to the .council/ state directory before dispatching. This is covered in the production engineering section, but initiating state write at this phase is critical for recovery.

Phase 2: Parallel Generation and Handling API Bottlenecks

Parallel inference against multiple frontier models requires a robust strategy for maintaining LLM Reliability, especially when production-scale requests hit 429 rate limits. This is not an edge case—it is the default behavior when querying OpenRouter or direct model APIs under any sustained load. The async execution layer must treat 429s and timeouts as first-class control flow, not exceptions.

import asyncio
import time
import httpx
import logging
from typing import Optional

logger = logging.getLogger(__name__)

OPENROUTER_BASE = "https://openrouter.ai/api/v1/chat/completions"

async def call_model_with_backoff(
    client: httpx.AsyncClient,
    task: "ExpertTask",  # ExpertTask from Phase 1
    api_key: str,
    max_retries: int = 5,
    base_delay: float = 1.0,
    timeout_seconds: float = 30.0,
) -> Optional[dict]:
    """
    Executes a single model call with exponential backoff.
    Returns None on exhaustion so the Chairman can handle degraded consensus.
    """
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://your-app.internal",  # Required by OpenRouter
    }
    payload = {
        "model": task.model_id,
        "messages": [{"role": "user", "content": task.prompt}],
        "temperature": 0.2,  # Low temp: reduce stochastic divergence per agent
    }

    for attempt in range(max_retries):
        try:
            response = await asyncio.wait_for(
                client.post(OPENROUTER_BASE, headers=headers, json=payload),
                timeout=timeout_seconds,
            )
            if response.status_code == 200:
                return response.json()

            if response.status_code == 429:
                # Respect Retry-After header if present; fall back to backoff
                retry_after = float(
                    response.headers.get("Retry-After", base_delay * (2 ** attempt))
                )
                logger.warning(
                    f"429 on {task.model_id}, attempt {attempt+1}. "
                    f"Sleeping {retry_after:.1f}s."
                )
                await asyncio.sleep(retry_after)
                continue

            # Non-retryable HTTP error (4xx except 429)
            if 400 <= response.status_code < 500:
                logger.error(f"Non-retryable {response.status_code} for {task.model_id}")
                return None

            # 5xx: server error, retry with backoff
            await asyncio.sleep(base_delay * (2 ** attempt))

        except asyncio.TimeoutError:
            logger.warning(f"Timeout on {task.model_id}, attempt {attempt+1}")
            await asyncio.sleep(base_delay * (2 ** attempt))
        except httpx.RequestError as exc:
            logger.error(f"Network error for {task.model_id}: {exc}")
            await asyncio.sleep(base_delay * (2 ** attempt))

    logger.error(f"Exhausted retries for {task.model_id}. Returning None.")
    return None


async def execute_parallel_generation(
    tasks: list["ExpertTask"], api_key: str
) -> dict[str, Optional[dict]]:
    """
    Dispatches all expert tasks concurrently; collects results keyed by model_id.
    """
    async with httpx.AsyncClient() as client:
        coroutines = {
            task.model_id: call_model_with_backoff(client, task, api_key)
            for task in tasks
        }
        results = await asyncio.gather(*coroutines.values(), return_exceptions=False)
    return dict(zip(coroutines.keys(), results))

Pro-Tip: Set temperature at or below 0.2 for all expert agents. Higher temperature increases response diversity but also increases the probability of divergent hallucinations that are harder for the Chairman to reconcile. Save stochastic variation for creative generation pipelines, not factual synthesis.

Mitigating Latency in Multi-Agent Inference

The P95 latency of a Council Mode call is bounded by the slowest model in the pool, not the average. Engineers must model this explicitly when setting SLA targets.

Model	Median Latency (TTFT)	P95 Latency	Relative Accuracy (HaluEval subset)
GPT-4o (single)	1.1s	3.4s	Baseline
Claude Opus 4 (single)	1.8s	4.9s	+4.2%
Gemini 2.5 Pro (single)	1.4s	3.9s	+3.1%
Council (3-model)	2.1s (parallel)	6.8s	+35.9% relative

Latency figures are representative estimates based on OpenRouter-mediated calls under moderate load; your network topology and token lengths will shift these values.

The parallel ceiling sits at ~2.1s median because all three models execute concurrently. The P95 spike to 6.8s reflects tail latency from occasional model-side cold starts. Two practical mitigations: (1) implement a speculative fast-path that returns the first model's response immediately if the query confidence score from Triage exceeds a threshold, and (2) configure the Chairman to synthesize from N-1 responses if one expert exceeds a hard timeout, rather than blocking the entire pipeline.

Phase 3: The Chairman Synthesis Protocol

The Chairman model is the most critical component and the most commonly misconfigured. Its prompt must accomplish three things simultaneously: parse potentially conflicting factual claims, assign credibility weights, and emit a structured output that downstream systems can parse deterministically. Free-form Chairman prompts produce free-form outputs; that is not consensus, that is paraphrasing.

import json
from typing import Optional

CHAIRMAN_SYSTEM_PROMPT = """You are a synthesis arbiter. You receive structured responses from multiple expert AI models.
Your task is to produce a single, verified consensus response.

Rules:
1. If experts agree on a claim, mark it HIGH_CONFIDENCE.
2. If experts partially agree (majority), mark it MEDIUM_CONFIDENCE with a note on the dissent.
3. If experts fundamentally conflict on a claim, mark it LOW_CONFIDENCE and present both positions.
4. Never fabricate information not present in at least one expert response.
5. Output ONLY valid JSON matching the schema below. No prose outside the JSON block.

Output Schema:
{
  "consensus_answer": "<primary synthesized answer>",
  "confidence_level": "<HIGH | MEDIUM | LOW>",
  "supporting_claims": [
    {"claim": "<string>", "confidence": "<HIGH | MEDIUM | LOW>", "source_models": ["<model_id>"]}
  ],
  "conflicts_detected": [
    {"topic": "<string>", "positions": {"<model_id>": "<position>"}}
  ],
  "abstentions": ["<model_id>"]
}"""

def build_chairman_prompt(
    original_query: str,
    expert_responses: dict[str, Optional[dict]],
) -> str:
    """
    Formats expert responses into a structured input block for the Chairman.
    Handles None responses (timeout/failure) by marking them as abstentions.
    """
    expert_blocks = []
    for model_id, response in expert_responses.items():
        if response is None:
            expert_blocks.append(f"[EXPERT: {model_id}]\nSTATUS: ABSTAINED (timeout or error)\n")
        else:
            content = response["choices"][0]["message"]["content"]
            expert_blocks.append(f"[EXPERT: {model_id}]\n{content}\n")

    experts_formatted = "\n---\n".join(expert_blocks)

    return (
        f"ORIGINAL QUERY:\n{original_query}\n\n"
        f"EXPERT RESPONSES:\n{experts_formatted}\n\n"
        f"Synthesize a consensus response following your system instructions."
    )

Technical Warning: If the Chairman model is from the same model family as a majority of the expert pool, its synthesis will be biased toward that family's training distribution. Use a model from a different provider than the plurality of your expert pool as Chairman.

Defining the Consensus Mathematical Weights

Rather than treating all expert responses equally, production deployments should weight each model's contribution by its empirical benchmark performance on query-type-specific evaluations. Define the weighted consensus score as:

$$ S_{consensus} = \frac{\sum_{i=1}^{N} w_i \cdot c_i \cdot r_i}{\sum_{i=1}^{N} w_i \cdot c_i} $$

Where: - $w_i$ = task-type weight for model $i$ (derived from domain-specific benchmark performance) - $c_i$ = confidence score emitted by model $i$ (normalized 0–1) - $r_i$ = response vector for model $i$ (binary 1/0 per factual claim in structured output) - $N$ = number of non-abstaining experts

This function is computed per factual claim in the Chairman's supporting_claims array, not across the entire response. A claim that achieves $S_{consensus} \geq 0.7$ with $N \geq 2$ non-abstaining experts passes the HIGH_CONFIDENCE threshold. Below 0.5 with conflicting positions triggers LOW_CONFIDENCE and surfaces to the end consumer.

Task-type weights $w_i$ require calibration: run your candidate model pool against a domain-representative evaluation set (50–100 labeled examples) and compute per-model F1 per task type before deploying to production.

Production Engineering: State Persistence and .council/ Directory

Implementing a robust Agentic Architecture requires tracking mid-execution state, as failures in multi-agent pipelines are otherwise unrecoverable. The .council/ directory provides a file-system-backed checkpoint store for every in-flight session, enabling both crash recovery and mid-execution re-planning when the Chairman detects irreconcilable conflicts.

import json
import os
import uuid
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Optional

COUNCIL_DIR = Path(".council")

def init_session(query: str) -> str:
    """Creates a new session directory and writes the initial state file."""
    COUNCIL_DIR.mkdir(exist_ok=True)
    session_id = str(uuid.uuid4())[:8]
    session_path = COUNCIL_DIR / session_id
    session_path.mkdir()

    state = {
        "session_id": session_id,
        "created_at": datetime.now(timezone.utc).isoformat(),
        "query": query,
        "phase": "triage",
        "expert_responses": {},
        "chairman_output": None,
        "status": "in_progress",
    }
    _write_state(session_path, state)
    return session_id


def update_phase(session_id: str, phase: str, **kwargs: Any) -> None:
    """Atomically updates the session state at phase transition."""
    state = load_state(session_id)
    state["phase"] = phase
    state["updated_at"] = datetime.now(timezone.utc).isoformat()
    state.update(kwargs)  # Merge any additional fields (e.g., expert_responses)
    _write_state(COUNCIL_DIR / session_id, state)


def load_state(session_id: str) -> dict:
    state_file = COUNCIL_DIR / session_id / "state.json"
    if not state_file.exists():
        raise FileNotFoundError(f"No session found: {session_id}")
    return json.loads(state_file.read_text())


def _write_state(session_path: Path, state: dict) -> None:
    """Writes state atomically via a temp file to prevent corruption on crash."""
    target = session_path / "state.json"
    tmp = session_path / "state.json.tmp"
    tmp.write_text(json.dumps(state, indent=2))
    tmp.replace(target)  # Atomic on POSIX; near-atomic on Windows NTFS

For high-throughput deployments (>50 concurrent sessions), replace the file-based store with Redis:

import redis
import json

_redis = redis.Redis(host="localhost", port=6379, decode_responses=True)
SESSION_TTL = 3600  # 1 hour expiry for orphaned sessions

def update_phase_redis(session_id: str, phase: str, **kwargs: Any) -> None:
    key = f"council:session:{session_id}"
    state = json.loads(_redis.get(key) or "{}")
    state["phase"] = phase
    state.update(kwargs)
    _redis.setex(key, SESSION_TTL, json.dumps(state))

Resilience in Multi-Agent Systems: Graceful Degradation

A Council Mode instance where two of three experts have timed out cannot produce a valid consensus—it produces a single-model response with consensus theater. The degradation handler must make this explicit and decide whether to re-queue, reduce quorum, or fail open.

import asyncio
from typing import Optional

async def execute_with_degradation(
    tasks: list,
    api_key: str,
    min_quorum: int = 2,
    requeue_timeout: float = 15.0,
) -> tuple[dict, list[str]]:
    """
    Runs parallel generation with dynamic re-queuing for failed experts.
    Returns (responses_dict, list_of_abstaining_model_ids).
    """
    results = await execute_parallel_generation(tasks, api_key)

    failed_tasks = [t for t in tasks if results.get(t.model_id) is None]
    succeeded_count = len(tasks) - len(failed_tasks)

    # Attempt one requeue pass for failed models before quorum evaluation
    if failed_tasks and succeeded_count < min_quorum:
        requeue_results = await asyncio.wait_for(
            execute_parallel_generation(failed_tasks, api_key),
            timeout=requeue_timeout,
        )
        results.update(requeue_results)

    # Re-evaluate quorum after requeue
    abstentions = [model_id for model_id, r in results.items() if r is None]
    active_count = len(tasks) - len(abstentions)

    if active_count < min_quorum:
        raise RuntimeError(
            f"Quorum failure: {active_count}/{len(tasks)} experts responded. "
            f"Minimum quorum is {min_quorum}. Aborting consensus."
        )

    return results, abstentions

Pro-Tip: Log quorum failures to a separate telemetry stream. A rising quorum failure rate on a specific model indicates upstream rate limit quota exhaustion, not transient errors. Adjust w_i weights downward for that model until quota recovers.

Evaluating HaluEval Gains vs. Operational Cost

The benchmark numbers warrant unpacking, with updated findings from the latest research paper ArXiv:2604.02923v2. The paper reports: "Our comprehensive evaluation across multiple benchmarks demonstrates that the Council Mode achieves a 35.9% relative reduction in hallucination rates on the HaluEval benchmark." Combined with the 7.8-point TruthfulQA improvement, these are not marginal wins—they represent the difference between a system that requires human review on every output and one that can operate with targeted spot-checking.

The architectural mechanism behind these gains: HaluEval tasks are specifically designed around factual claims where a single model will confidently confabulate. When three architecturally distinct models independently generate responses, they draw on different pretraining corpora, RLHF pipelines, and instruction-tuning datasets. A hallucination that is consistent across all three models would require the same false association to be encoded in all three training distributions—statistically improbable for obscure factual claims.

Configuration	HaluEval Score	TruthfulQA	Cost per 1K calls (est.)	Latency P95
Single model (GPT-4o)	Baseline	Baseline	~$15	3.4s
Single model (Claude Opus 4)	+4.2%	+1.9pt	~$22	4.9s
Council Mode (3-model)	+35.9% relative	+7.8pt	~$58–$72	6.8s
Council Mode (2-model, degraded)	+18–22% relative	+3–4pt	~$35–$45	4.2s

Cost estimates assume OpenRouter pricing tiers as of Q1 2026 and average 800-token request/response pairs. Production token lengths will shift these values linearly.

The cost multiplier is 3–5x over a single-model call. Whether this is justified depends entirely on the downstream cost of a hallucination. For legal document analysis, medical triage support, or financial data extraction, a 35.9% reduction in hallucination rate at a 4x cost increase is unambiguously ROI-positive. For a customer-facing FAQ bot, it is almost certainly over-engineered.

Future-Proofing Your Agentic Deployment

The scalability and reliability of Multi-Agent Systems depend on an architecture that is more durable than any specific model it currently uses. The Triage-Generation-Synthesis pattern is model-agnostic: as frontier models evolve, you replace pool members without modifying the consensus logic.

Key scaling considerations:

Pool rotation cadence: Evaluate your expert pool against domain benchmarks quarterly. Replace underperforming models; the 35.9% gain depends on architectural diversity, which erodes if the entire industry converges on similar base architectures.
MoE-native integration: As Mixture-of-Experts models become the dominant serving architecture, the internal expert routing within a single MoE model partially overlaps with Council Mode's external routing. The distinction matters: internal MoE routing is not independently verifiable; external multi-model consensus is. They are complementary, not substitutable.
Adaptive quorum by query type: Triage should route low-risk queries (general knowledge) to a 2-model council and high-stakes queries (financial, medical) to a 4+ model council. Static quorum sizes are a deployment convenience, not an architectural requirement.
Chairman model versioning: Pin the Chairman model version explicitly in your .council/ session state. A Chairman model update mid-deployment can silently shift synthesis behavior; your evaluation metrics should reflect the Chairman version, not just the pool.
Feedback loop integration: Log all LOW_CONFIDENCE outputs and resolved conflicts back into a domain-specific evaluation set. Over time, this builds a ground-truth corpus for recalibrating $w_i$ weights.

Finalizing Your Implementation Blueprint

#!/usr/bin/env bash
# Council Mode environment bootstrap
# Requires Python 3.10+, tested on Ubuntu 22.04 and macOS 14+

set -euo pipefail

# --- 1. Python environment ---
python3 --version | grep -E "3\.(1[0-9])" || {
    echo "Python 3.10+ required"; exit 1;
}

python3 -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install \
    httpx==0.27.2 \
    redis==5.0.8 \
    asyncio-throttle==1.0.2 \
    python-dotenv==1.0.1

# --- 2. Working directory initialization ---
mkdir -p .council
touch .council/.gitkeep

# Exclude session state from version control
grep -qxF '.council/' .gitignore 2>/dev/null || echo '.council/' >> .gitignore

# --- 3. Environment variables ---
cat > .env << 'EOF'
# OpenRouter API key — https://openrouter.ai/keys
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Optional: Redis for high-concurrency deployments
REDIS_HOST=localhost
REDIS_PORT=6379

# Council configuration
COUNCIL_MIN_QUORUM=2
COUNCIL_MAX_RETRIES=5
COUNCIL_TIMEOUT_SECONDS=30
COUNCIL_SESSION_DIR=.council
EOF

echo "Council Mode environment initialized."
echo "Set OPENROUTER_API_KEY in .env before running."

# --- 4. Validate connectivity (dry run) ---
source .env
python3 - << 'PYEOF'
import os, httpx, asyncio
from dotenv import load_dotenv
load_dotenv()

async def ping():
    headers = {"Authorization": f"Bearer {os.environ['OPENROUTER_API_KEY']}"}
    async with httpx.AsyncClient() as c:
        r = await c.get("https://openrouter.ai/api/v1/models", headers=headers)
        print(f"OpenRouter connectivity: HTTP {r.status_code}")
        assert r.status_code == 200, "API key invalid or network unreachable"

asyncio.run(ping())
PYEOF

Technical Warning: Never commit .env to version control. The .gitignore addition in step 2 covers .council/ session state but not .env—add it explicitly: echo '.env' >> .gitignore.

The full execution flow ties together as: init_session → triage_query → execute_with_degradation → update_phase → build_chairman_prompt → Chairman API call → parse JSON output → update_phase(status="complete"). Each phase writes to .council/<session_id>/state.json, giving you a full audit trail of every council decision with associated model responses and confidence scores.

Keywords: Multi-Agent Consensus, Agentic Architecture, LLM Hallucination Mitigation, Mixture-of-Experts (MoE), Heterogeneous Model Inference, Triage-Generation-Synthesis Pattern, OpenRouter API, State Persistence, Prompt Chaining, Asynchronous Latency Management

Was this guide helpful?

Share: X · LinkedIn · Reddit