Background: what this benchmark is trying to isolate
Most published RAG framework comparisons are not comparable. They mix different models, different embedding strategies, different retrievers, and different prompt budgets — then present the results as if the framework were the independent variable. The AImultiple benchmark took a different approach: fix every component of the stack that isn't the framework itself, then measure what's left.
Under that protocol, LangChain, LlamaIndex, Haystack, LangGraph, and DSPy all ran the same agentic RAG workflow with identical models, embeddings, retriever, web search tool, and temperature settings. The benchmark covered 100 queries repeated 100 times per framework — 10,000 total query executions each — and the result was unambiguous on one dimension and nuanced on another.
Bottom Line: When the model, embeddings, retriever, and tools are held constant, accuracy differences between LangChain, LlamaIndex, and Haystack collapse to zero — all five frameworks tested achieved 100% accuracy on the benchmark's test set, according to AImultiple's controlled comparison. The real differentiation surfaces in orchestration overhead (roughly 3–14 ms per query) and token consumption per query, not in answer correctness. Choosing a framework on accuracy claims alone, without controlling the comparison, is a methodological error.
How the comparison was run under identical query settings
The foundational problem with framework benchmarks is what the AImultiple team called "standardized component" isolation. As they describe it: "We benchmarked 5 RAG frameworks: LangChain, LangGraph, LlamaIndex, Haystack, and DSPy, by building the same agentic RAG workflow with standardized components... This isolates each framework's true overhead and token efficiency."
To make that isolation real, every component that could introduce noise — model choice, embedding model, vector store, web search tool, and generation temperature — was fixed across all five frameworks. LangChain, LlamaIndex, and Haystack ran identical pipelines; only the orchestration and framework-native abstractions differed.
| Variable | Controlled Setting |
|---|---|
| LLM | GPT-4.1-mini via OpenRouter |
| Embedding model | BGE-small |
| Vector store / retriever | Qdrant |
| Web search tool | Tavily |
| Generation temperature | 0.0 (deterministic) |
| Query set | 100 queries × 100 repetitions |
Watch Out: Any competitor roundup that doesn't publish this kind of controlled-variable table cannot be treated as an apples-to-apples comparison. Changing the model family, swapping the retriever, or using different prompt budgets can shift latency by orders of magnitude larger than the framework overhead measured here — making LangChain look fast or LlamaIndex look slow depending solely on what else was changed. Non-reproducible benchmarks should not drive production decisions.
Shared model, embedding, retriever, and tool stack
The controlled stack was built around OpenAI-compatible hosted APIs, which carries an important implication: measured latency includes both framework orchestration logic and external API round-trip behavior. Framework overhead was isolated by holding the API calls constant and comparing the delta.
| Component | Choice | Purpose |
|---|---|---|
| LLM | GPT-4.1-mini (OpenRouter, temperature 0.0) | Router, generator, grader nodes |
| Embedding | BGE-small | Document and query encoding |
| Vector store | Qdrant | Retrieval against indexed document chunks |
| Web tool | Tavily | Live web search for agentic queries |
| API protocol | OpenAI-compatible | Consistent across all five frameworks |
The AImultiple benchmark specifies: "Model: All nodes (router, generator, grader) used the openai/gpt-4.1-mini model via the OpenRouter API. Determinism: temperature was set to 0.0 for all LLM calls to ensure maximum consistency in routing, generation, and grading."
Setting temperature to 0.0 is critical for a comparative benchmark: stochastic outputs would require significantly larger sample sizes to average out generation variance, making framework overhead harder to isolate.
Test set size, repetition count, and measurement approach
The benchmark used a 100-query test set and ran each framework through it 100 times, producing 10,000 query executions per framework. Reported metrics are per-query averages across those 10,000 executions — a design that suppresses API jitter and surfaces stable orchestration differentials.
| Metric | Value |
|---|---|
| Queries in test set | 100 |
| Repetitions per framework | 100 |
| Total executions per framework | 10,000 |
| Primary latency metric | Average per-query framework overhead (ms) |
| Accuracy metric | Pass/fail on test set ground truth |
| Token metric | Average tokens per query execution |
The tradeoff of this design is that averages over 10,000 runs are excellent for comparative ranking but do not reveal tail latency. P95 and P99 behavior — the numbers that govern real SLA breaches — require a different measurement strategy. The benchmark is optimized for stable comparative measurement, not stress testing.
At-a-glance comparison of the frameworks
Under identical stack conditions, all five frameworks delivered equivalent answer quality. The meaningful differences are architectural and operational, not correctness-based. Haystack is the production-pipeline option in the set, with a declarative workflow model that fits enterprise RAG and governance-heavy deployments.
| Dimension | LangChain | LlamaIndex | Haystack |
|---|---|---|---|
| Primary orientation | Broad-purpose LLM orchestration | Data-first, retrieval-centered | Production pipeline control |
| Benchmark overhead | ~10 ms/query | ~6 ms/query | ~5.9 ms/query |
| Benchmark accuracy | 100% | 100% | 100% |
| Integration breadth | Widest (agents, tools, chains) | Strong on document pipelines | Strong on enterprise deployment |
| Token profile | Higher relative to peers | Lower (retrieval-focused paths) | Lower (structured pipelines) |
| Primary selection signal | Orchestration complexity | Document volume and retrieval depth | Operational control and governance |
The benchmark's central finding is that framework choice cannot be justified by accuracy differences when the stack is held constant. Selection criteria must come from orchestration requirements, retrieval architecture, and operational discipline — not from claims about one framework producing "better" answers.
LangChain under the controlled benchmark
LangChain posted approximately 10 ms of framework overhead per query in the AImultiple controlled setup — the highest of the three main comparison targets, while still achieving 100% accuracy alongside every other framework tested.
| Metric | LangChain |
|---|---|
| Framework overhead | ~10 ms/query |
| Benchmark accuracy | 100% |
| Relative token usage | Higher than LlamaIndex and Haystack |
The per-query overhead figure represents the framework's orchestration contribution — chain resolution, tool routing, and abstraction layers — after the model and retriever costs are held constant across all frameworks.
Where LangChain absorbs extra orchestration cost
LangChain's overhead advantage is effectively the cost of its flexibility. The framework is designed to support heterogeneous agent graphs, pluggable tool ecosystems, memory layers, and multi-step chains. That abstraction breadth means more coordination logic executes on every query, even in a simple agentic RAG workflow where much of that capability sits unused.
The benchmark does not attribute the overhead to a specific feature — the AImultiple report measures the net effect rather than profiling individual components. The causal direction should be stated cautiously: more abstraction layers correlate with more overhead in this test; causality at the component level requires framework-internal profiling.
Pro Tip: LangChain's ~10 ms overhead is not a reason to avoid it — it is a price you pay for orchestration breadth. If your production pipeline requires multi-step agents, diverse tool calls, or frequent workflow reconfiguration, that overhead is likely dwarfed by the developer velocity gained. Treat the overhead figure as a cost signal, not a verdict. Where a simpler pipeline suffices, LangChain's abstraction layers become unnecessary tax.
LlamaIndex under the controlled benchmark
LlamaIndex posted approximately 6 ms of framework overhead per query in the AImultiple controlled setup, finishing ahead of LangChain and just behind Haystack. Secondary reporting on the same benchmark places LlamaIndex's average token consumption at roughly 1,600 tokens per query, reflecting the framework's retrieval-first architecture.
| Metric | LlamaIndex |
|---|---|
| Framework overhead | ~6 ms/query |
| Benchmark accuracy | 100% |
| Reported token usage | ~1,600 tokens/query (secondary source) |
The overhead profile aligns with LlamaIndex's design priorities: the framework organizes around data connectors, index structures, and query engines rather than general-purpose agent orchestration. Fewer abstraction layers for non-retrieval tasks translate into lower framework tax when the workflow is retrieval-dominated.
Where LlamaIndex narrows the token bill
LlamaIndex's token efficiency comes from its architectural focus: the framework is built around retrieval pipelines, and its query engine abstractions are designed to pass well-bounded context windows to the LLM. Where LangChain might carry broader context through chain steps, LlamaIndex tends to scope what reaches the generator.
In the controlled benchmark, that efficiency held without any accuracy penalty — the 100% accuracy result was common across all frameworks. However, the benchmark's fixed context behavior and simple agentic workflow are favorable conditions for a retrieval-focused architecture.
Watch Out: Lower token usage in a controlled benchmark does not automatically translate to lower cost in production. If your workflow introduces retries, multi-hop queries, longer documents, or tool chains that expand context, LlamaIndex's token profile will shift. Validate token efficiency on your own document distribution and query mix before committing to it as a cost reduction lever.
Haystack under the controlled benchmark
Haystack posted approximately 5.9 ms of framework overhead per query in the AImultiple controlled setup — the lowest of the three main comparison targets — while matching the 100% accuracy result that all frameworks achieved under the controlled stack.
| Metric | Haystack |
|---|---|
| Framework overhead | ~5.9 ms/query |
| Benchmark accuracy | 100% |
| Relative token usage | Lower (similar tier to LlamaIndex) |
The 0.1 ms gap between Haystack and LlamaIndex is not practically meaningful for most production workloads. What the numbers communicate together is that both frameworks sit in a distinctly lower-overhead tier than LangChain, and that DSPy — a compiled, optimization-driven framework — occupies a separate efficiency class at ~3.53 ms.
Where Haystack fits enterprise RAG pipelines
Haystack is commonly selected not for raw benchmark performance but for pipeline construction discipline. Its declarative pipeline model enforces explicit component contracts, which aids observability, testability, and change management — properties that matter in regulated or high-compliance environments where an LLM pipeline is treated as production software.
Production Note: A benchmark tie on accuracy does not resolve questions about observability, maintainability, or compliance requirements. Haystack's declarative component model gives operations teams explicit control over what runs, in what order, with what fallbacks — and that structural auditability can outweigh a few milliseconds of overhead difference when the system is handling regulated data or requires change-management governance. Enterprise teams evaluating Haystack should weigh these operational properties alongside the benchmark metrics.
Benchmark results: accuracy, overhead, and token usage
The full overhead ordering across all five frameworks, as reported by AImultiple:
| Framework | Overhead (ms/query) | Benchmark Accuracy | Relative Token Usage |
|---|---|---|---|
| DSPy | ~3.53 ms | 100% | — |
| Haystack | ~5.9 ms | 100% | Lower |
| LlamaIndex | ~6 ms | 100% | Lower (~1,600 tokens/query) |
| LangChain | ~10 ms | 100% | Higher |
| LangGraph | ~14 ms | 100% | — |
The table surfaces two distinct findings. First, accuracy is not a differentiator when the stack is fixed. Second, the overhead spread from the most efficient framework (DSPy at ~3.53 ms) to the least efficient (LangGraph at ~14 ms) is roughly 4×. Between the three primary targets — LangChain, LlamaIndex, and Haystack — the spread is approximately 4 ms per query.
As AImultiple states: "Framework overhead is measurable but small: We observed ~3–14 ms per query from orchestration logic."
Why accuracy converged to 100% across frameworks
Accuracy converged because every variable that determines answer quality — the model generating the response, the embeddings encoding the query and documents, the retriever selecting context, and the tools providing external information — was held identical across all five frameworks. The framework layer is responsible for orchestrating those components, not for the quality of their outputs. When orchestration is the only variable, correctness differences disappear, and overhead differences emerge.
Pro Tip: This is the core insight for practitioners evaluating RAG frameworks: identical models and retrievers compress accuracy differences to zero and expose orchestration effects. If your evaluation shows accuracy divergence between frameworks, the most likely explanation is that something else in the stack differs — not that the framework intrinsically improves model quality. Accurate benchmark design requires this kind of constraint, and any benchmark that doesn't enforce it is measuring framework + stack together, not framework alone.
What the millisecond gaps mean in real production
Framework overhead is a per-query tax. At low query volumes, the aggregate is negligible. At production scale, the numbers accumulate in ways that affect cost and SLA headroom.
$(\text{Daily overhead cost (ms)} = \text{overhead per query (ms)} \times \text{queries per day})$
Applying the benchmark's observed range:
| Overhead | 10K queries/day | 100K queries/day |
|---|---|---|
| 3 ms (DSPy) | 30,000 ms = 30 s aggregate | 300,000 ms = 300 s aggregate |
| 5.9 ms (Haystack) | 59,000 ms = 59 s aggregate | 590,000 ms = 590 s aggregate |
| 6 ms (LlamaIndex) | 60,000 ms = 60 s aggregate | 600,000 ms = 600 s aggregate |
| 10 ms (LangChain) | 100,000 ms = 100 s aggregate | 1,000,000 ms = 1,000 s aggregate |
| 14 ms (LangGraph) | 140,000 ms = 140 s aggregate | 1,400,000 ms = 1,400 s aggregate |
Aggregate latency is not user-perceived wall-clock time — these numbers represent the sum of orchestration overhead across all queries, not any single user's experience. The practical implication is for capacity planning: at 100K queries/day, choosing LangChain over Haystack adds roughly 41,000 ms, or 41 seconds, of additional orchestration work per day on the critical path of every query. At high concurrency, that overhead consumes CPU cycles and can degrade tail latency even when average latency looks acceptable.
Token overhead matters separately. If LangChain's higher-token paths mean each query costs more against an API with per-token pricing, the token delta compounds daily cost in a way that doesn't appear in the latency numbers at all.
Decision matrix: when to choose each framework
No framework won on accuracy. The decision between LangChain, LlamaIndex, and Haystack reduces to matching framework architecture to production requirements.
| Scenario | Recommended Framework | Reason |
|---|---|---|
| Multi-step agent workflows, many integrations | LangChain | Broadest orchestration surface; overhead justified by tool breadth |
| Document-heavy RAG, token cost matters | LlamaIndex | Retrieval-first design, lower token footprint |
| Enterprise deployment, governance matters | Haystack | Declarative pipelines, operational control |
| Lowest possible framework overhead | DSPy | Compiled optimization; different programming model |
| LLM-orchestrated graph workflows | LangGraph | Native graph execution; highest overhead of the five |
Choose LangChain when orchestration breadth matters more than token thrift
Choose LangChain when your production system requires heterogeneous agent graphs, diverse tool integrations, multi-step reasoning chains, or frequent workflow iteration. LangChain's ~10 ms overhead and higher token consumption per query are the measurable cost of maintaining that flexibility. Teams building systems where the workflow itself is a product — where the orchestration logic changes frequently and integrates with many external APIs — typically find LangChain's abstraction layer justified.
The benchmark makes the trade-off explicit: no accuracy gain, ~4 ms more overhead than Haystack, and higher token counts. If those costs fit within your SLA and budget, LangChain's ecosystem depth and developer familiarity often accelerate delivery.
Choose LlamaIndex when retrieval-centric design matters most
Choose LlamaIndex when the system is document-heavy, retrieval is the performance bottleneck, and token efficiency directly affects cost. At ~6 ms framework overhead and approximately 1,600 tokens per query in the controlled test, LlamaIndex occupies the same low-overhead tier as Haystack while offering native primitives for indexing, chunking, and multi-document retrieval.
Teams building on large, heterogeneous document corpora — legal, financial, technical documentation — where the retrieval pipeline needs first-class treatment rather than bolt-on integration will find LlamaIndex's architecture aligns with their problem naturally.
Choose Haystack when production pipeline control matters most
Choose Haystack when the operational discipline of the pipeline is as important as its performance characteristics. At ~5.9 ms overhead, Haystack matches LlamaIndex on the benchmark's efficiency tier while providing a declarative component model that treats each pipeline stage as an explicit, inspectable, testable unit.
Regulated industries, teams with strict change-management requirements, or organizations where the RAG pipeline must satisfy audit requirements will find Haystack's structural approach valuable — even if the raw benchmark numbers are nearly identical to LlamaIndex.
Limits of this benchmark and how to reproduce it responsibly
The AImultiple benchmark is well-designed for its stated purpose: isolating framework overhead under controlled conditions. It is not a proxy for all production RAG performance.
| Controlled variable | Benchmark value | Reproduction requirement |
|---|---|---|
| LLM | GPT-4.1-mini via OpenRouter | Match model family and API provider |
| Embedding | BGE-small | Use same or equivalent model |
| Vector store | Qdrant | Match or document substitution |
| Web tool | Tavily | Match or document substitution |
| Temperature | 0.0 | Required for determinism |
| Query set size | 100 queries | Minimum viable; expand for domain coverage |
| Repetitions | 100 per framework | Required for stable averages |
Watch Out: The benchmark used a narrow, controlled test set of 100 queries under a hosted API stack. It does not represent: (1) tail latency under high concurrency; (2) performance on domain-specific document distributions with different chunk sizes or retrieval depths; (3) overhead when agent graphs grow in complexity; (4) self-hosted model deployments where API round-trip behavior is absent. Overgeneralizing from one hosted setup to a production SLA is a methodological error — run the same protocol on your own data, your own infrastructure, and your own query distribution before making a final framework selection.
Reproducing the benchmark requires matching the shared stack as closely as possible: GPT-4.1-mini (or the same model family via OpenRouter), BGE-small embeddings, Qdrant as the vector store, Tavily for web search, temperature 0.0, and a test set with ground-truth labels that supports pass/fail accuracy evaluation. Running fewer than 50 repetitions produces averages that are too noisy to distinguish framework overhead from API jitter.
FAQ
How do you benchmark RAG frameworks fairly?
Fix everything except the framework. The minimum controlled set is: LLM (same model, same provider, same temperature), embedding model, vector store, retrieval tool, web search tool, query budget, and context caps. Run the same query set at least 50–100 times per framework to average out API jitter. Measure framework overhead separately from model latency and retrieval latency. The AImultiple benchmark applied this approach with 100 queries × 100 repetitions, which produced the stable per-framework overhead averages reported here.
Is LlamaIndex better than LangChain for RAG?
Under identical conditions, neither produced better answers — both achieved 100% accuracy on the benchmark's test set. LlamaIndex posted lower framework overhead (~6 ms vs. ~10 ms) and lower token consumption, making it more cost-efficient for retrieval-dominated workloads. LangChain's advantage is orchestration breadth, not correctness. "Better" depends on whether your bottleneck is token cost and retrieval efficiency (LlamaIndex) or workflow flexibility and integration coverage (LangChain).
Why is Haystack used?
Haystack is selected primarily for production pipeline discipline rather than raw benchmark performance. Its declarative pipeline model makes each processing stage explicit and testable, which matters for observability, compliance, and change management in enterprise environments. In this benchmark, it also posted the lowest overhead of the three primary targets (~5.9 ms/query), but that margin over LlamaIndex (~6 ms) is not the primary selection reason — operational control is.
What is the difference between LangChain, LlamaIndex, and Haystack?
LangChain is a general-purpose LLM orchestration framework with the widest tool and agent integration surface. LlamaIndex is retrieval-centered, built around document indexing and query engines with first-class support for data ingestion pipelines. Haystack is pipeline-first, with a declarative component model suited to production deployments where auditability and operational control matter. All three can build functional RAG systems; the differences are in where each framework applies its deepest abstractions and what it optimizes for.
Do the 3–14 ms overhead differences matter in production?
At low query volumes they don't. At 100K queries/day, the gap between DSPy (~3.53 ms) and LangGraph (~14 ms) accumulates to over 1,000 seconds of aggregate orchestration work per day. Between LangChain (~10 ms) and Haystack (~5.9 ms), the delta is about 41 seconds of additional orchestration work per day at 100K queries. Whether that matters depends on your concurrency model, SLA headroom, and whether the overhead sits on the critical path of every user request.
Sources & References
- AImultiple — RAG Frameworks: LangChain vs LangGraph vs LlamaIndex vs Haystack vs DSPy benchmark — Primary source for all benchmark overhead figures, accuracy results, shared stack specification, and test design details cited throughout this article.
- LangChain official site — Framework documentation and integration reference.
- LlamaIndex official site — Framework documentation and retrieval pipeline reference.
- Haystack by deepset official site — Framework documentation and enterprise deployment reference.
Keywords: LangChain, LlamaIndex, Haystack, LangGraph, DSPy, GPT-4.1-mini, BGE-small, Qdrant, Tavily, OpenAI-compatible APIs, agentic RAG, retrieval-augmented generation, token efficiency, framework overhead, Qdrant vector database



