Skip to content
AxiomLogicaSearch
AI & ML

MultiHop-RAG benchmarks explained: what the dataset reveals about iterative retrieval

MultiHop-RAG shows that existing RAG methods struggle when evidence is spread across 2 to 4 documents — the benchmark’s 2,556-query setup exposes the weakness of single-pass retrieval and motivates iterative retrieval — but the paper demonstrates this on a news-article knowledge base, so the result is strong evidence for multi-hop failure modes rather than a universal fix.

MultiHop-RAG benchmarks explained: what the dataset reveals about iterative retrieval
MultiHop-RAG benchmarks explained: what the dataset reveals about iterative retrieval

Why MultiHop-RAG matters for multi-hop RAG

Single-pass RAG systems retrieve a top-k set of chunks and hand them to a language model. That architecture works when a query's complete answer lives in one or two semantically proximate passages. It fails systematically when the evidence is fragmented — when answering requires combining facts from documents that share no obvious lexical or embedding-space overlap with the query alone.

MultiHop-RAG (Tang and Yang, 2024) is a benchmark designed to make that failure visible and measurable. It provides 2,556 queries whose supporting evidence is explicitly distributed across 2 to 4 documents, a ground-truth answer for each query, and a benchmark structure intended to separate retrieval quality from downstream reasoning quality. The authors' core finding is direct: "existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence."

Bottom Line: MultiHop-RAG provides a dedicated benchmark showing that retrieval failure — not just reasoning failure — is the primary bottleneck for multi-hop queries in RAG pipelines. Its 2,556-query corpus, spread across a news-article knowledge base with evidence chains spanning 2 to 4 documents, gives engineers a reproducible diagnostic for the specific failure mode that agentic and iterative retrieval architectures are meant to address. The benchmark does not prove any specific iterative architecture works in production; it proves that single-pass retrieval breaks here, which is the prerequisite question.

How the benchmark was built

The MultiHop-RAG construction pipeline produces three artifacts for each query: the query itself, its ground-truth answer, and the set of supporting documents whose evidence must be combined to reach that answer. The knowledge base is a curated English news-article corpus. The evaluation is split into two independent phases — retrieval quality and response quality — so that a retrieval failure does not contaminate the reasoning measurement and vice versa. The paper labels this process in its figure caption as "Figure 2: MultiHop-RAG Construction Pipeline."

Component Detail
Dataset name MultiHop-RAG
Query count 2,556
Evidence span per query 2 to 4 documents
Knowledge base type English news-article corpus
Evaluation split Retrieval evaluation + Response evaluation (independent)
Primary source arXiv 2401.15391

The separation of retrieval and response evaluation is architecturally significant: most prior multi-hop benchmarks conflate the two, making it impossible to determine whether a model failed because it retrieved the wrong passages or because it could not reason over correct ones. MultiHop-RAG's two-stage design resolves that ambiguity.

The news-article knowledge base and why it matters

The benchmark uses an English news-article corpus as its knowledge base rather than Wikipedia, a general web crawl, or a synthetic domain. News articles exhibit several properties that sharpen retrieval difficulty: they are temporally ordered, they cover overlapping entities from different angles across multiple documents, and they contain partial information about any given event that can only be resolved by cross-document integration. This makes the corpus a high-fidelity stress test for cross-document evidence chaining.

Pro Tip: The news-article corpus choice is a deliberate design decision that increases retrieval difficulty by distributing topically related but stylistically distinct passages across many documents. This is a sharper diagnostic than a Wikipedia-based benchmark because Wikipedia articles are internally structured to be self-contained. The implication: results from MultiHop-RAG are strong evidence for cross-document retrieval failure modes in editorial-style corpora, but they should not be treated as proof that identical failure rates will appear in enterprise knowledge bases (technical documentation, legal corpora, financial filings) with different chunk topology.

How the 2,556 queries were generated and labeled

Each query in MultiHop-RAG requires integrating information from 2 to 4 documents. The construction pipeline generates multi-hop queries from the news corpus, assigns a ground-truth answer, and annotates the set of supporting documents required to reach that answer. The response evaluation phase then uses the ground-truth answer directly: after a retriever returns candidate passages, a language model synthesizes an answer from those passages, and that answer is compared against the ground truth.

The paper describes the response evaluation rationale explicitly: "Response Evaluation: Since the multi-hop query requires reasoning over multiple pieces of retrieved chunks, we can also evaluate the reasoning capability of the LLM by comparing the LLM response with the ground truth answer of the query."

Stage What is annotated Purpose
Query generation Multi-hop question from news articles Defines the retrieval target
Ground-truth answer Final answer requiring 2–4 sources Anchors response evaluation
Supporting evidence set 2–4 specific documents per query Anchors retrieval evaluation
Retrieval eval Did the retriever surface all required documents? Isolates retrieval quality
Response eval Does the LLM answer match ground truth? Isolates reasoning quality

The pipeline's explicit annotation of the supporting evidence set is what makes the retrieval evaluation meaningful: you can measure whether a given retriever recovered all necessary documents for a query, independent of whether the language model then used those documents correctly.

Why 2-to-4-document evidence is the key stress test

A naive top-k retriever optimizes for a single relevance signal: the similarity between the query embedding and each document chunk's embedding. When the answer to a query requires combining evidence across multiple documents, a single retrieval pass will miss part of the chain in any configuration where k is practically bounded.

MultiHop-RAG's 2-to-4-document evidence constraint operationalizes exactly this failure mode. The minimum case captures short cross-document dependency chains, while the maximum case creates longer evidence chains where each retrieval step is conditioned on prior context that the retriever never received.

Pro Tip: Standard top-k retrieval with k=5 or k=10 can accidentally retrieve all required documents for a simple bridge query — which inflates recall metrics and masks the structural failure. MultiHop-RAG's evidence chains requiring 3 to 4 documents are the benchmark's diagnostic core: they expose the cases where a single-pass retriever cannot succeed by luck. When evaluating a RAG system with this benchmark, filter results by evidence-chain length before reporting aggregate recall numbers.

HotpotQA addressed multi-hop QA on Wikipedia with supporting-fact supervision, and it remains the reference point for explainable multi-hop QA research. MultiHop-RAG differs in corpus (news articles vs. Wikipedia), in explicit RAG pipeline framing (retrieval + generation as separate eval targets), and in the absence of distractor-paragraph supervision. The two benchmarks are complementary, not interchangeable.

What the retrieval experiments show

MultiHop-RAG runs retrieval evaluation independently from response evaluation — a design choice that lets practitioners directly measure how well any given retriever surfaces the complete evidence set for each query, before asking a language model to reason over it.

The paper's Table 5 comparison across embedding models produces a clear qualitative finding: "there is still a significant gap in retrieving relevant evidence for the multi-hop queries." This result holds across the embedding models tested, meaning the failure is not an artifact of a suboptimal embedding choice.

Evaluation target What is measured Key finding
Retrieval (Table 5) Embedding model vs. evidence recall Significant gap persists across all tested embeddings
Response (separate) LLM answer vs. ground truth Evaluated after retrieval; isolates reasoning failure
Evidence coverage All 2–4 required docs retrieved? Primary retrieval success criterion
Embedding models tested Multiple (see paper Table 5 for full numeric results) No tested model closes the gap

The practical implication of the Table 5 finding is that retrieval failure on multi-hop queries is not primarily a vector-space representation problem. The embeddings are not simply bad — they are structurally insufficient for queries whose full intent cannot be expressed in a single query vector.

Why embedding choice still leaves a gap

Switching embedding models does not fix multi-hop retrieval failure because the failure is architectural, not representational. A better embedding model can improve single-hop recall and reduce false positives on ambiguous passages, but it cannot make a single query vector simultaneously express both the initial retrieval intent and the conditioned retrieval intent that only becomes apparent after reading the first retrieved document.

Watch Out: Teams that benchmark their LlamaIndex retrieval pipeline only on single-hop queries and then upgrade to a higher-quality embedding model will see genuine gains on those queries — but MultiHop-RAG's results demonstrate that those gains do not extrapolate to multi-hop scenarios. Before attributing retrieval failures to embedding quality, classify your query set by evidence-chain complexity.

The benchmark's finding that "there is still a significant gap in retrieving relevant evidence for the multi-hop queries" across all tested embeddings suggests that closing the gap requires a retrieval architecture change — iterative retrieval, query decomposition, or a re-ranking step that conditions on intermediate retrieved context — rather than a drop-in embedding upgrade.

What the reasoning experiments show once evidence is retrieved

Once the retrieval phase is complete — whether it succeeds or fails — MultiHop-RAG's response evaluation asks a separate question: given the retrieved passages, can the language model synthesize a correct answer? This two-phase design means the benchmark measures reasoning capability independently of retrieval quality, provided the evaluation is run with the ground-truth evidence set supplied directly to the model.

The paper evaluates GPT-4, PaLM, and a set of open-weight models including Llama-2-70b-chat-hf, Mixtral-8x7B-Instruct, and Claude-2.1 on this task. The framing is explicit: the goal is to determine whether models can reason over multiple retrieved chunks to match a ground-truth answer.

Model evaluated Model class Evaluation condition
GPT-4 Proprietary LLM Response vs. ground truth given retrieved chunks
ChatGPT Proprietary LLM Response vs. ground truth given retrieved chunks
PaLM (Google) Proprietary LLM Response vs. ground truth given retrieved chunks
Llama-2-70b-chat-hf Open-weight 70B Response vs. ground truth given retrieved chunks
Mixtral-8x7B-Instruct Open-weight MoE Response vs. ground truth given retrieved chunks
Claude-2.1 Proprietary LLM Response vs. ground truth given retrieved chunks

The paper does not surface final numeric per-model scores in the abstract or public snippets, so this article does not assert specific accuracy figures. For full numeric results, consult Table results in the arXiv PDF.

How GPT-4, PaLM, and Llama2-70B behaved on provided evidence

The comparison across GPT-4, PaLM, and Llama2-70B-class models evaluates a specific question: after a retriever has surfaced passages (whether the correct ones or not), do larger and more capable models produce better answers on multi-hop queries? The answer the paper implies is nuanced — model capability helps, but the fundamental difficulty of combining evidence from multiple passages means that even the strongest models evaluated show unsatisfactory performance when retrieval is incomplete.

Pro Tip: The response evaluation on MultiHop-RAG is most informative when run with oracle retrieval — that is, with the ground-truth supporting documents provided directly. If a model still fails to produce the correct answer with perfect evidence supplied, the failure mode is purely in reasoning (cross-document synthesis, coreference, temporal ordering). If a model succeeds with oracle evidence but fails in the end-to-end pipeline, the failure is retrieval. MultiHop-RAG's two-stage design makes this diagnostic split possible; most production RAG evaluations do not.

The breadth of the model comparison — spanning proprietary frontier models and open-weight models at 70B scale — strengthens the paper's claim that the problem is not specific to any one model class. Multi-hop evidence synthesis is hard across the board at the scale of models available in early 2024.

How MultiHop-RAG compares with HotpotQA

HotpotQA (Yang et al., 2018) introduced multi-hop QA over Wikipedia with explicit supporting-fact annotations and a joint answer-and-support evaluation format. It demonstrated that "HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions." MultiHop-RAG shares the multi-hop framing but differs in every structural dimension relevant to retrieval-augmented generation.

Dimension HotpotQA MultiHop-RAG
Knowledge base Wikipedia articles that are relatively self-contained English news articles that force cross-document retrieval and integration
Primary task framing Multi-hop QA with supporting-fact supervision RAG retrieval benchmark with generation as a separate stage
Evaluation targets Answer accuracy + supporting fact F1 Retrieval recall over evidence chains + response accuracy
Distractor design Distractor paragraphs included in context Full corpus retrieval without pre-filtered distractors
Retrieval treatment Evidence often pre-provided or retrieved from a fixed candidate set Retrieval is the primary experimental variable and the main failure point
Why it matters for RAG Useful for explainable QA Sharper stress test for retrieval-centric RAG evaluation

The structural difference that matters most for RAG practitioners: HotpotQA was designed as a QA reading-comprehension benchmark with retrieval as a secondary concern. MultiHop-RAG was designed with retrieval as the primary experimental variable. If the question is "how well does my retriever handle multi-hop evidence chains?", MultiHop-RAG is the more direct diagnostic.

Where HotpotQA still matters for multi-hop evaluation

HotpotQA remains the reference benchmark for research on explainable multi-hop reasoning — work that requires supporting-fact-level supervision to train or fine-tune models to produce chain-of-thought or highlight-style justifications. The supporting-fact annotations enable training objectives that MultiHop-RAG's design does not support.

Watch Out: HotpotQA and MultiHop-RAG both use static knowledge bases, neither measures latency or cost under realistic query distributions, and neither includes tool-calling orchestration failures. Using either benchmark's results as the primary justification for an architectural shift to agentic RAG is a category error — these benchmarks diagnose retrieval failure modes, they do not validate production system designs. Run them as part of a broader evaluation suite that includes latency profiling and cost-per-correct-answer measurement on your actual data.

What the benchmark does not prove about production agentic RAG

MultiHop-RAG is a benchmark construction and evaluation paper, not a production deployment study. The public repository provides retrieval and answer evaluation scripts pinned to llama-index==0.9.40 for reproducibility. No production-scale latency data, cost-per-query measurements, or orchestration reliability numbers are published in the paper or repository.

The benchmark proves that single-pass retrieval fails on multi-hop queries at a rate high enough to matter. It does not prove that any specific iterative or agentic retrieval architecture is production-ready, cost-effective, or reliably better in enterprise workloads with heterogeneous query distributions.

What MultiHop-RAG demonstrates What MultiHop-RAG does not demonstrate
Single-pass retrieval fails on 2–4-document evidence chains That iterative retrieval reliably solves this in production
Embedding quality alone does not fix the problem Latency or cost overhead of iterative loops
Reasoning failure persists even after retrieval succeeds Orchestration reliability in agentic loops
The failure mode is consistent across model families Generalization to non-news enterprise corpora
Retrieval and reasoning are separable failure modes Which specific RAG architecture closes the gap

The repository description frames this scope accurately: "MultiHop-RAG: a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines." It is a diagnostic tool, not an architecture recommendation.

Latency, cost, and orchestration trade-offs for iterative retrieval

Iterative retrieval — issuing multiple retrieval calls conditioned on intermediate results — increases the per-query cost and latency of a RAG pipeline. The MultiHop-RAG paper and repository do not publish latency or cost-per-query benchmarks for iterative loops, so the following is a decision framework based on the structural trade-offs, not the paper's numbers:

Option Choose this when Avoid this when
Single-pass RAG Queries are predominantly single-hop or close-domain, latency SLAs are tight, and related chunks already cluster in embedding space The evidence chain regularly spans multiple documents and retrieval misses are common
Reranking with LlamaIndex Initial retrieval finds the right documents but rank order is weak, especially on 2-document chains The second hop is never retrieved at all
Query decomposition One query can be decomposed into smaller sub-questions without losing intent The answer depends on intermediate context that the model must carry across hops
Iterative retrieval or an agent loop Evidence chains reliably span 3 or more documents and later queries depend on earlier retrieved context The query mix is mostly single-hop and latency or reliability budgets are strict

The public sources do not give a numeric latency threshold or a multiplier for iterative loops, so the decision should be made from your own evidence-chain distribution, not a generalized overhead figure.

How to reproduce the benchmark with public tooling

The MultiHop-RAG GitHub repository provides runnable Python scripts for retrieval and QA workflows, separate evaluation scripts for retrieval and response quality, and a pinned dependency on llama-index==0.9.40. Reproduction targets standard research hardware — no large-scale training is involved; the compute requirement is inference-only over the news corpus and the 2,556-query set.

This reproduction snippet mirrors the repository's retrieval workflow and is included here only as a compact reference:

from llama_index import SimpleDirectoryReader, VectorStoreIndex
from llama_index.retrievers import VectorIndexRetriever

# Load the news-article corpus from a local directory.
documents = SimpleDirectoryReader("data/news_articles").load_data()
index = VectorStoreIndex.from_documents(documents)
retriever = VectorIndexRetriever(index=index, similarity_top_k=10)

query = "Which company acquired the startup founded by the executive who left OpenAI in 2023?"
nodes = retriever.retrieve(query)
for node in nodes:
    print(node.score, node.node.get_content()[:120])

The command sequence below follows the public repository workflow directly:

$ git clone https://github.com/yixuantt/MultiHop-RAG.git
$ cd MultiHop-RAG
$ pip install -r requirement.txt   # pins llama-index==0.9.40 and dependencies
$ python qa_llama.py               # runs the QA workflow; saves retrieval results
$ python eval_retrieval.py         # scores retrieval against ground-truth evidence sets
$ python eval_answer.py            # scores model responses against ground-truth answers

The separation of eval_retrieval.py and eval_answer.py is the benchmark's key reproducibility feature: you can evaluate your retriever's evidence coverage independently of any language model, then feed oracle evidence into eval_answer.py to isolate reasoning quality.

Minimal hardware and evaluation setup

Running the retrieval and evaluation scripts does not require GPU training infrastructure. The retrieval evaluation (eval_retrieval.py) is CPU-bound over the index and query set. The response evaluation (eval_answer.py) requires API access to the models under test (GPT-4, PaLM, or a locally hosted Llama2-70B-class model) but not local GPU inference for the proprietary models.

# Configuration block for MultiHop-RAG reproduction
corpus_dir: "data/news_articles"        # path to downloaded news article corpus
queries_file: "data/multihop_queries.json"  # 2,556 labeled queries with ground truth
retrieval_results_dir: "toy_data/"      # output dir; repo uses voyage-02_rerank_retrieval_test.json format
embedding_model: "text-embedding-ada-002"  # swap to compare embedding models per Table 5
top_k: 10                               # increase to 15-20 for multi-hop recall improvement
llm_model: "gpt-4"                      # model under test for response evaluation

Standard research hardware — a workstation or cloud instance with 16 GB RAM — is sufficient for indexing and retrieval. Local inference of Llama-2-70b-chat-hf requires at minimum an 80 GB GPU (or two 40 GB A100s with tensor parallelism); using the OpenAI or PaLM APIs eliminates that requirement.

Practical takeaways for engineers designing multi-hop RAG

MultiHop-RAG surfaces two separable failure modes that a basic RAG system, once in production, will hit on a predictable fraction of queries: retrieval failure (the evidence set is never fully recovered) and reasoning failure (the evidence is present but the model cannot synthesize it correctly). The benchmark demonstrates that both failure modes persist across multiple embedding models and multiple LLM families.

Bottom Line: If your production RAG system shows degraded quality on questions that require combining facts from multiple documents, MultiHop-RAG tells you to measure evidence recall before investing in LLM upgrades. If recall is below 70% on 3-document chains, treat the problem as retrieval architecture, not model capability. Use LlamaIndex's sub-question query engine or a retrieval-conditioned reranker as the lowest-overhead first step; escalate to a full agent loop only when evidence chains reliably span 4 or more documents and the answer-correctness lift justifies the added orchestration complexity.

When to add an agent loop, reranker, or query decomposition

The decision to add complexity to a retrieval pipeline should be driven by evidence-chain length in your specific corpus and query distribution, not by benchmark results alone. MultiHop-RAG's 2-to-4-document evidence structure maps directly to four architectural interventions:

Option Choose this when Avoid this when
Single-pass top-k retrieval Fewer than 15% of queries require more than one document for a complete answer, and recall@10 is already strong on sampled multi-hop queries Your query logs show repeated misses on multi-document questions
Cross-encoder reranker (LlamaIndex CohereRerank or SentenceTransformerRerank) The right documents are retrieved but buried below rank 3, and evidence chains are mostly 2 documents The second hop is not retrieved at all
Query decomposition (SubQuestionQueryEngine) Queries are compositional and can be split into smaller sub-questions without losing intent The answer depends on intermediate context that must be carried across hops
Agent loop / iterative retrieval Evidence chains span 3 to 4 documents, intermediate retrieval results must inform subsequent queries, and you have validated the lift on your own data Query latency SLAs are under 1 second or orchestration reliability risk is high

Production Note: Use this decision matrix against your own query logs and evidence-chain labels before changing the retrieval stack. The benchmark is most useful as a calibration tool for recall thresholds, reranking, and query decomposition, not as proof that any single orchestration pattern is always superior.

FAQ

What is MultiHop-RAG?

MultiHop-RAG is a benchmark dataset for evaluating retrieval-augmented generation on queries that require combining evidence from multiple documents. It contains 2,556 queries, each with a ground-truth answer and supporting evidence distributed across 2 to 4 documents, drawn from an English news-article knowledge base. The benchmark evaluates retrieval quality and response quality as independent measurements.

How is MultiHop-RAG different from HotpotQA?

HotpotQA is a multi-hop QA benchmark over Wikipedia with supporting-fact supervision, designed primarily for reading comprehension and explainable prediction research. MultiHop-RAG is a retrieval-augmented generation benchmark over a news-article corpus, designed to measure how well a full RAG pipeline — retriever plus language model — handles cross-document evidence chains. HotpotQA often pre-supplies or pre-filters context; MultiHop-RAG makes open-corpus retrieval the primary experimental variable.

Does MultiHop-RAG prove iterative retrieval is better than single-pass RAG?

No. MultiHop-RAG proves that single-pass retrieval fails on multi-hop queries and that better embedding models do not eliminate the gap. It does not evaluate any specific iterative retrieval architecture, so it is evidence that single-pass is insufficient — not proof that any alternative is sufficient or production-ready.

What does MultiHop-RAG evaluate: retrieval or reasoning?

Both, independently. The benchmark splits evaluation into a retrieval phase (did the retriever recover all required documents?) and a response phase (did the model synthesize a correct answer from the retrieved chunks?). This two-stage design is what makes the benchmark useful for diagnosing which part of a RAG pipeline is failing.

Can I reproduce MultiHop-RAG with LlamaIndex?

Yes. The public repository provides retrieval and answer evaluation scripts pinned to llama-index==0.9.40. Clone the repo, install dependencies from requirement.txt, and run qa_llama.py, eval_retrieval.py, and eval_answer.py in sequence on standard research hardware.

Pro Tip: For the clearest interpretation of retrieval results, consult the benchmark table in the paper's Table 5 directly rather than relying on aggregate accuracy numbers from the combined eval. The retrieval and response scores tell different stories; a system can achieve high response accuracy on the fraction of queries where retrieval accidentally succeeded while hiding a structural recall failure on 3-to-4-document evidence chains.

Sources & References


Keywords: MultiHop-RAG, HotpotQA, LlamaIndex 0.9.40, GPT-4, PaLM, Llama2-70B, Retrieval-Augmented Generation (RAG), multi-hop question answering, embedding models, iterative retrieval, news-article knowledge base, evidence chains, retrieval evaluation, answer evaluation

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit