Skip to content
AxiomLogicaSearch
AI & ML

RAG benchmark comparison: LangChain, LlamaIndex, and Haystack under identical query settings

When frameworks are tested under identical models, embeddings, retrievers, and query budgets, the real differences show up less in answer accuracy and more in orchestration overhead and token efficiency, with benchmarked gaps on the order of milliseconds and hundreds of tokens per query.

RAG benchmark comparison: LangChain, LlamaIndex, and Haystack under identical query settings
RAG benchmark comparison: LangChain, LlamaIndex, and Haystack under identical query settings

Background: what this benchmark is trying to isolate

Most published RAG framework comparisons are not comparable. They mix different models, different embedding strategies, different retrievers, and different prompt budgets — then present the results as if the framework were the independent variable. The AImultiple benchmark took a different approach: fix every component of the stack that isn't the framework itself, then measure what's left.

Under that protocol, LangChain, LlamaIndex, Haystack, LangGraph, and DSPy all ran the same agentic RAG workflow with identical models, embeddings, retriever, web search tool, and temperature settings. The benchmark covered 100 queries repeated 100 times per framework — 10,000 total query executions each — and the result was unambiguous on one dimension and nuanced on another.

Bottom Line: When the model, embeddings, retriever, and tools are held constant, accuracy differences between LangChain, LlamaIndex, and Haystack collapse to zero — all five frameworks tested achieved 100% accuracy on the benchmark's test set, according to AImultiple's controlled comparison. The real differentiation surfaces in orchestration overhead (roughly 3–14 ms per query) and token consumption per query, not in answer correctness. Choosing a framework on accuracy claims alone, without controlling the comparison, is a methodological error.


How the comparison was run under identical query settings

The foundational problem with framework benchmarks is what the AImultiple team called "standardized component" isolation. As they describe it: "We benchmarked 5 RAG frameworks: LangChain, LangGraph, LlamaIndex, Haystack, and DSPy, by building the same agentic RAG workflow with standardized components... This isolates each framework's true overhead and token efficiency."

To make that isolation real, every component that could introduce noise — model choice, embedding model, vector store, web search tool, and generation temperature — was fixed across all five frameworks. LangChain, LlamaIndex, and Haystack ran identical pipelines; only the orchestration and framework-native abstractions differed.

Variable Controlled Setting
LLM GPT-4.1-mini via OpenRouter
Embedding model BGE-small
Vector store / retriever Qdrant
Web search tool Tavily
Generation temperature 0.0 (deterministic)
Query set 100 queries × 100 repetitions

Watch Out: Any competitor roundup that doesn't publish this kind of controlled-variable table cannot be treated as an apples-to-apples comparison. Changing the model family, swapping the retriever, or using different prompt budgets can shift latency by orders of magnitude larger than the framework overhead measured here — making LangChain look fast or LlamaIndex look slow depending solely on what else was changed. Non-reproducible benchmarks should not drive production decisions.

Shared model, embedding, retriever, and tool stack

The controlled stack was built around OpenAI-compatible hosted APIs, which carries an important implication: measured latency includes both framework orchestration logic and external API round-trip behavior. Framework overhead was isolated by holding the API calls constant and comparing the delta.

Component Choice Purpose
LLM GPT-4.1-mini (OpenRouter, temperature 0.0) Router, generator, grader nodes
Embedding BGE-small Document and query encoding
Vector store Qdrant Retrieval against indexed document chunks
Web tool Tavily Live web search for agentic queries
API protocol OpenAI-compatible Consistent across all five frameworks

The AImultiple benchmark specifies: "Model: All nodes (router, generator, grader) used the openai/gpt-4.1-mini model via the OpenRouter API. Determinism: temperature was set to 0.0 for all LLM calls to ensure maximum consistency in routing, generation, and grading."

Setting temperature to 0.0 is critical for a comparative benchmark: stochastic outputs would require significantly larger sample sizes to average out generation variance, making framework overhead harder to isolate.

Test set size, repetition count, and measurement approach

The benchmark used a 100-query test set and ran each framework through it 100 times, producing 10,000 query executions per framework. Reported metrics are per-query averages across those 10,000 executions — a design that suppresses API jitter and surfaces stable orchestration differentials.

Metric Value
Queries in test set 100
Repetitions per framework 100
Total executions per framework 10,000
Primary latency metric Average per-query framework overhead (ms)
Accuracy metric Pass/fail on test set ground truth
Token metric Average tokens per query execution

The tradeoff of this design is that averages over 10,000 runs are excellent for comparative ranking but do not reveal tail latency. P95 and P99 behavior — the numbers that govern real SLA breaches — require a different measurement strategy. The benchmark is optimized for stable comparative measurement, not stress testing.


At-a-glance comparison of the frameworks

Under identical stack conditions, all five frameworks delivered equivalent answer quality. The meaningful differences are architectural and operational, not correctness-based. Haystack is the production-pipeline option in the set, with a declarative workflow model that fits enterprise RAG and governance-heavy deployments.

Dimension LangChain LlamaIndex Haystack
Primary orientation Broad-purpose LLM orchestration Data-first, retrieval-centered Production pipeline control
Benchmark overhead ~10 ms/query ~6 ms/query ~5.9 ms/query
Benchmark accuracy 100% 100% 100%
Integration breadth Widest (agents, tools, chains) Strong on document pipelines Strong on enterprise deployment
Token profile Higher relative to peers Lower (retrieval-focused paths) Lower (structured pipelines)
Primary selection signal Orchestration complexity Document volume and retrieval depth Operational control and governance

The benchmark's central finding is that framework choice cannot be justified by accuracy differences when the stack is held constant. Selection criteria must come from orchestration requirements, retrieval architecture, and operational discipline — not from claims about one framework producing "better" answers.


LangChain under the controlled benchmark

LangChain posted approximately 10 ms of framework overhead per query in the AImultiple controlled setup — the highest of the three main comparison targets, while still achieving 100% accuracy alongside every other framework tested.

Metric LangChain
Framework overhead ~10 ms/query
Benchmark accuracy 100%
Relative token usage Higher than LlamaIndex and Haystack

The per-query overhead figure represents the framework's orchestration contribution — chain resolution, tool routing, and abstraction layers — after the model and retriever costs are held constant across all frameworks.

Where LangChain absorbs extra orchestration cost

LangChain's overhead advantage is effectively the cost of its flexibility. The framework is designed to support heterogeneous agent graphs, pluggable tool ecosystems, memory layers, and multi-step chains. That abstraction breadth means more coordination logic executes on every query, even in a simple agentic RAG workflow where much of that capability sits unused.

The benchmark does not attribute the overhead to a specific feature — the AImultiple report measures the net effect rather than profiling individual components. The causal direction should be stated cautiously: more abstraction layers correlate with more overhead in this test; causality at the component level requires framework-internal profiling.

Pro Tip: LangChain's ~10 ms overhead is not a reason to avoid it — it is a price you pay for orchestration breadth. If your production pipeline requires multi-step agents, diverse tool calls, or frequent workflow reconfiguration, that overhead is likely dwarfed by the developer velocity gained. Treat the overhead figure as a cost signal, not a verdict. Where a simpler pipeline suffices, LangChain's abstraction layers become unnecessary tax.


LlamaIndex under the controlled benchmark

LlamaIndex posted approximately 6 ms of framework overhead per query in the AImultiple controlled setup, finishing ahead of LangChain and just behind Haystack. Secondary reporting on the same benchmark places LlamaIndex's average token consumption at roughly 1,600 tokens per query, reflecting the framework's retrieval-first architecture.

Metric LlamaIndex
Framework overhead ~6 ms/query
Benchmark accuracy 100%
Reported token usage ~1,600 tokens/query (secondary source)

The overhead profile aligns with LlamaIndex's design priorities: the framework organizes around data connectors, index structures, and query engines rather than general-purpose agent orchestration. Fewer abstraction layers for non-retrieval tasks translate into lower framework tax when the workflow is retrieval-dominated.

Where LlamaIndex narrows the token bill

LlamaIndex's token efficiency comes from its architectural focus: the framework is built around retrieval pipelines, and its query engine abstractions are designed to pass well-bounded context windows to the LLM. Where LangChain might carry broader context through chain steps, LlamaIndex tends to scope what reaches the generator.

In the controlled benchmark, that efficiency held without any accuracy penalty — the 100% accuracy result was common across all frameworks. However, the benchmark's fixed context behavior and simple agentic workflow are favorable conditions for a retrieval-focused architecture.

Watch Out: Lower token usage in a controlled benchmark does not automatically translate to lower cost in production. If your workflow introduces retries, multi-hop queries, longer documents, or tool chains that expand context, LlamaIndex's token profile will shift. Validate token efficiency on your own document distribution and query mix before committing to it as a cost reduction lever.


Haystack under the controlled benchmark

Haystack posted approximately 5.9 ms of framework overhead per query in the AImultiple controlled setup — the lowest of the three main comparison targets — while matching the 100% accuracy result that all frameworks achieved under the controlled stack.

Metric Haystack
Framework overhead ~5.9 ms/query
Benchmark accuracy 100%
Relative token usage Lower (similar tier to LlamaIndex)

The 0.1 ms gap between Haystack and LlamaIndex is not practically meaningful for most production workloads. What the numbers communicate together is that both frameworks sit in a distinctly lower-overhead tier than LangChain, and that DSPy — a compiled, optimization-driven framework — occupies a separate efficiency class at ~3.53 ms.

Where Haystack fits enterprise RAG pipelines

Haystack is commonly selected not for raw benchmark performance but for pipeline construction discipline. Its declarative pipeline model enforces explicit component contracts, which aids observability, testability, and change management — properties that matter in regulated or high-compliance environments where an LLM pipeline is treated as production software.

Production Note: A benchmark tie on accuracy does not resolve questions about observability, maintainability, or compliance requirements. Haystack's declarative component model gives operations teams explicit control over what runs, in what order, with what fallbacks — and that structural auditability can outweigh a few milliseconds of overhead difference when the system is handling regulated data or requires change-management governance. Enterprise teams evaluating Haystack should weigh these operational properties alongside the benchmark metrics.


Benchmark results: accuracy, overhead, and token usage

The full overhead ordering across all five frameworks, as reported by AImultiple:

Framework Overhead (ms/query) Benchmark Accuracy Relative Token Usage
DSPy ~3.53 ms 100%
Haystack ~5.9 ms 100% Lower
LlamaIndex ~6 ms 100% Lower (~1,600 tokens/query)
LangChain ~10 ms 100% Higher
LangGraph ~14 ms 100%

The table surfaces two distinct findings. First, accuracy is not a differentiator when the stack is fixed. Second, the overhead spread from the most efficient framework (DSPy at ~3.53 ms) to the least efficient (LangGraph at ~14 ms) is roughly 4×. Between the three primary targets — LangChain, LlamaIndex, and Haystack — the spread is approximately 4 ms per query.

As AImultiple states: "Framework overhead is measurable but small: We observed ~3–14 ms per query from orchestration logic."

Why accuracy converged to 100% across frameworks

Accuracy converged because every variable that determines answer quality — the model generating the response, the embeddings encoding the query and documents, the retriever selecting context, and the tools providing external information — was held identical across all five frameworks. The framework layer is responsible for orchestrating those components, not for the quality of their outputs. When orchestration is the only variable, correctness differences disappear, and overhead differences emerge.

Pro Tip: This is the core insight for practitioners evaluating RAG frameworks: identical models and retrievers compress accuracy differences to zero and expose orchestration effects. If your evaluation shows accuracy divergence between frameworks, the most likely explanation is that something else in the stack differs — not that the framework intrinsically improves model quality. Accurate benchmark design requires this kind of constraint, and any benchmark that doesn't enforce it is measuring framework + stack together, not framework alone.

What the millisecond gaps mean in real production

Framework overhead is a per-query tax. At low query volumes, the aggregate is negligible. At production scale, the numbers accumulate in ways that affect cost and SLA headroom.

$(\text{Daily overhead cost (ms)} = \text{overhead per query (ms)} \times \text{queries per day})$

Applying the benchmark's observed range:

Overhead 10K queries/day 100K queries/day
3 ms (DSPy) 30,000 ms = 30 s aggregate 300,000 ms = 300 s aggregate
5.9 ms (Haystack) 59,000 ms = 59 s aggregate 590,000 ms = 590 s aggregate
6 ms (LlamaIndex) 60,000 ms = 60 s aggregate 600,000 ms = 600 s aggregate
10 ms (LangChain) 100,000 ms = 100 s aggregate 1,000,000 ms = 1,000 s aggregate
14 ms (LangGraph) 140,000 ms = 140 s aggregate 1,400,000 ms = 1,400 s aggregate

Aggregate latency is not user-perceived wall-clock time — these numbers represent the sum of orchestration overhead across all queries, not any single user's experience. The practical implication is for capacity planning: at 100K queries/day, choosing LangChain over Haystack adds roughly 41,000 ms, or 41 seconds, of additional orchestration work per day on the critical path of every query. At high concurrency, that overhead consumes CPU cycles and can degrade tail latency even when average latency looks acceptable.

Token overhead matters separately. If LangChain's higher-token paths mean each query costs more against an API with per-token pricing, the token delta compounds daily cost in a way that doesn't appear in the latency numbers at all.


Decision matrix: when to choose each framework

No framework won on accuracy. The decision between LangChain, LlamaIndex, and Haystack reduces to matching framework architecture to production requirements.

Scenario Recommended Framework Reason
Multi-step agent workflows, many integrations LangChain Broadest orchestration surface; overhead justified by tool breadth
Document-heavy RAG, token cost matters LlamaIndex Retrieval-first design, lower token footprint
Enterprise deployment, governance matters Haystack Declarative pipelines, operational control
Lowest possible framework overhead DSPy Compiled optimization; different programming model
LLM-orchestrated graph workflows LangGraph Native graph execution; highest overhead of the five

Choose LangChain when orchestration breadth matters more than token thrift

Choose LangChain when your production system requires heterogeneous agent graphs, diverse tool integrations, multi-step reasoning chains, or frequent workflow iteration. LangChain's ~10 ms overhead and higher token consumption per query are the measurable cost of maintaining that flexibility. Teams building systems where the workflow itself is a product — where the orchestration logic changes frequently and integrates with many external APIs — typically find LangChain's abstraction layer justified.

The benchmark makes the trade-off explicit: no accuracy gain, ~4 ms more overhead than Haystack, and higher token counts. If those costs fit within your SLA and budget, LangChain's ecosystem depth and developer familiarity often accelerate delivery.

Choose LlamaIndex when retrieval-centric design matters most

Choose LlamaIndex when the system is document-heavy, retrieval is the performance bottleneck, and token efficiency directly affects cost. At ~6 ms framework overhead and approximately 1,600 tokens per query in the controlled test, LlamaIndex occupies the same low-overhead tier as Haystack while offering native primitives for indexing, chunking, and multi-document retrieval.

Teams building on large, heterogeneous document corpora — legal, financial, technical documentation — where the retrieval pipeline needs first-class treatment rather than bolt-on integration will find LlamaIndex's architecture aligns with their problem naturally.

Choose Haystack when production pipeline control matters most

Choose Haystack when the operational discipline of the pipeline is as important as its performance characteristics. At ~5.9 ms overhead, Haystack matches LlamaIndex on the benchmark's efficiency tier while providing a declarative component model that treats each pipeline stage as an explicit, inspectable, testable unit.

Regulated industries, teams with strict change-management requirements, or organizations where the RAG pipeline must satisfy audit requirements will find Haystack's structural approach valuable — even if the raw benchmark numbers are nearly identical to LlamaIndex.


Limits of this benchmark and how to reproduce it responsibly

The AImultiple benchmark is well-designed for its stated purpose: isolating framework overhead under controlled conditions. It is not a proxy for all production RAG performance.

Controlled variable Benchmark value Reproduction requirement
LLM GPT-4.1-mini via OpenRouter Match model family and API provider
Embedding BGE-small Use same or equivalent model
Vector store Qdrant Match or document substitution
Web tool Tavily Match or document substitution
Temperature 0.0 Required for determinism
Query set size 100 queries Minimum viable; expand for domain coverage
Repetitions 100 per framework Required for stable averages

Watch Out: The benchmark used a narrow, controlled test set of 100 queries under a hosted API stack. It does not represent: (1) tail latency under high concurrency; (2) performance on domain-specific document distributions with different chunk sizes or retrieval depths; (3) overhead when agent graphs grow in complexity; (4) self-hosted model deployments where API round-trip behavior is absent. Overgeneralizing from one hosted setup to a production SLA is a methodological error — run the same protocol on your own data, your own infrastructure, and your own query distribution before making a final framework selection.

Reproducing the benchmark requires matching the shared stack as closely as possible: GPT-4.1-mini (or the same model family via OpenRouter), BGE-small embeddings, Qdrant as the vector store, Tavily for web search, temperature 0.0, and a test set with ground-truth labels that supports pass/fail accuracy evaluation. Running fewer than 50 repetitions produces averages that are too noisy to distinguish framework overhead from API jitter.


FAQ

How do you benchmark RAG frameworks fairly?

Fix everything except the framework. The minimum controlled set is: LLM (same model, same provider, same temperature), embedding model, vector store, retrieval tool, web search tool, query budget, and context caps. Run the same query set at least 50–100 times per framework to average out API jitter. Measure framework overhead separately from model latency and retrieval latency. The AImultiple benchmark applied this approach with 100 queries × 100 repetitions, which produced the stable per-framework overhead averages reported here.

Is LlamaIndex better than LangChain for RAG?

Under identical conditions, neither produced better answers — both achieved 100% accuracy on the benchmark's test set. LlamaIndex posted lower framework overhead (~6 ms vs. ~10 ms) and lower token consumption, making it more cost-efficient for retrieval-dominated workloads. LangChain's advantage is orchestration breadth, not correctness. "Better" depends on whether your bottleneck is token cost and retrieval efficiency (LlamaIndex) or workflow flexibility and integration coverage (LangChain).

Why is Haystack used?

Haystack is selected primarily for production pipeline discipline rather than raw benchmark performance. Its declarative pipeline model makes each processing stage explicit and testable, which matters for observability, compliance, and change management in enterprise environments. In this benchmark, it also posted the lowest overhead of the three primary targets (~5.9 ms/query), but that margin over LlamaIndex (~6 ms) is not the primary selection reason — operational control is.

What is the difference between LangChain, LlamaIndex, and Haystack?

LangChain is a general-purpose LLM orchestration framework with the widest tool and agent integration surface. LlamaIndex is retrieval-centered, built around document indexing and query engines with first-class support for data ingestion pipelines. Haystack is pipeline-first, with a declarative component model suited to production deployments where auditability and operational control matter. All three can build functional RAG systems; the differences are in where each framework applies its deepest abstractions and what it optimizes for.

Do the 3–14 ms overhead differences matter in production?

At low query volumes they don't. At 100K queries/day, the gap between DSPy (~3.53 ms) and LangGraph (~14 ms) accumulates to over 1,000 seconds of aggregate orchestration work per day. Between LangChain (~10 ms) and Haystack (~5.9 ms), the delta is about 41 seconds of additional orchestration work per day at 100K queries. Whether that matters depends on your concurrency model, SLA headroom, and whether the overhead sits on the critical path of every user request.


Sources & References


Keywords: LangChain, LlamaIndex, Haystack, LangGraph, DSPy, GPT-4.1-mini, BGE-small, Qdrant, Tavily, OpenAI-compatible APIs, agentic RAG, retrieval-augmented generation, token efficiency, framework overhead, Qdrant vector database

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit