Skip to content
AxiomLogicaSearch
AI & ML

Self-RAG vs CRAG in LangGraph: which corrective retrieval pattern fits production RAG?

CRAG is better when retrieval ambiguity is the problem because it adds a lightweight evaluator plus web-search fallback, while Self-RAG is better when you want the model itself to self-reflect through retrieval and support checks — but Self-RAG’s richer control logic usually costs more LLM calls, so the best choice depends on latency budget and how much correction you need.

Self-RAG vs CRAG in LangGraph: which corrective retrieval pattern fits production RAG?
Self-RAG vs CRAG in LangGraph: which corrective retrieval pattern fits production RAG?

How we compared Self-RAG and CRAG in LangGraph

Both Self-RAG and CRAG are implemented as LangGraph state machines in the same LangChain engineering post published February 7, 2024. The post frames them explicitly as "flow engineering" examples for self-reflective RAG: "Here, we show that LangGraph can be easily used for 'flow engineering' of self-reflective RAG." That framing matters — both patterns sit above basic RAG on the complexity ladder, and the post treats them as alternative orchestration strategies rather than complementary layers.

At a Glance: Self-RAG is the better fit when answer grounding and multi-hop correction matter more than latency; CRAG is the better fit when retrieval quality, freshness, and a web-search fallback are the primary concerns.

The comparison axes that matter for a production decision are retrieval quality lift, per-query LLM call count, end-to-end latency, implementation complexity, observability, and whether a web-search fallback is structurally required. The official LangChain examples demonstrate both flows but deliberately omit a decision rubric across those dimensions. This article supplies it.

One hard constraint from the source material: no authoritative shared benchmark publishes side-by-side latency or call-count numbers for Self-RAG versus CRAG under a matched workload. Where the table below shows relative ordering rather than absolute numbers, that framing is intentional and honest.

Criterion Self-RAG CRAG
Retrieval quality lift High — graded at retrieve, support, and useful stages Moderate-to-high — graded at retrieval-quality stage only
LLM call count per query Higher — multiple reflection passes per answer Lower — one evaluator call gates the path
End-to-end latency Higher — reflection loops add inference steps Lower (no fallback) / higher (with web search)
Implementation complexity Higher — multi-stage control graph with looping Moderate — linear evaluator → rewrite/fallback → generate
Observability surface Larger — retrieve, grade, support, useful nodes all traceable Smaller — evaluator score and fallback trigger are the key signals
Web-search fallback Not structurally required Architecturally central

Self-RAG vs CRAG at a glance

Self-RAG ("Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection") controls retrieval and generation through reflection tokens that gate four decisions per query: whether to retrieve, whether retrieved passages are relevant, whether the generation is supported by the passage, and whether the final answer is useful. CRAG ("Corrective Retrieval Augmented Generation") takes a narrower approach: a lightweight retrieval evaluator scores the initial passage set and routes the query along one of three paths — proceed, rewrite, or fall back to web search.

The operational distinction is where correction happens. Self-RAG corrects throughout the generation process; CRAG corrects before generation begins. LangGraph encodes both as directed graphs with conditional edges, but the graphs differ significantly in node count and loop structure.

Dimension Self-RAG CRAG
Correction point Pre-generation, mid-generation, post-generation Pre-generation only
Primary failure mode addressed Hallucination and unsupported answers Weak or stale retrieval
Fallback mechanism Internal re-retrieval loop External web search
LangGraph graph topology Cyclical with multiple grading nodes Mostly linear with one decision fork
Best fit Multi-hop QA, strict grounding requirements Ambiguous queries, fresh-knowledge requirements

Self-RAG in LangGraph: when reflection loops help

Self-RAG's defining characteristic in a LangGraph implementation is that LLM inference fires multiple times per user query — once to decide whether to retrieve, once to grade passage relevance, once to assess whether the generation is supported by the retrieved passage, and once more to assess whether the answer is useful. LangChain's own evaluation framework confirms the core failure mode this addresses: "Faithfulness evaluation specifically checks whether agent responses can be logically inferred from the provided context." Self-RAG builds that faithfulness check directly into the generation loop rather than bolting it on post-hoc.

Production Note: Each reflection pass is a separate LLM call. For a query that triggers retrieve → grade → generate → support-check → regenerate, you can realistically see 4–6 inference calls where a naive RAG pipeline makes 1. The exact multiplier is workload-dependent and should be measured against your latency budget before committing to Self-RAG. The official LangChain examples are notebook-scale demonstrations; they do not publish normalized cost-per-query figures for Self-RAG versus a baseline.

Pro Tip: Self-RAG's multi-call overhead is most justified when retrieval is not the primary problem — that is, when your vector index (Pinecone, Milvus, or an HNSW-backed store) retrieves plausible passages but the LLM still generates unsupported or hallucinated answers. If your RAGAS faithfulness score is low while recall is adequate, Self-RAG targets the actual failure mode. If recall is the problem, CRAG is a better fit.

How reflection tokens control retrieve, support, and useful decisions

The Self-RAG paper introduces reflection tokens as special control signals that a fine-tuned model learns to emit at inference time. These tokens implement a multi-stage decision process over four questions that the LangGraph flow exposes as conditional edges:

Decision type Reflection token type LangGraph action on failure
retrieve Retrieve token Branch to retrieval node
relevance / grade ISREL token Loop back and re-retrieve
support ISSUP token Regenerate with the same or new passage
useful ISUSE token Loop back to retrieve or halt

In the LangGraph notebook simplification, these four token types collapse into grading functions that evaluate retrieval relevance and generation faithfulness sequentially. The exact thresholds are implementation-specific and not standardized — the paper establishes the control scheme, but production deployments on Llama 3.1 70B or GPT-4 class models must calibrate graders on their own data.

Where Self-RAG fits: multi-hop questions and strict grounding

Self-RAG's orchestration overhead pays off when a single retrieval pass cannot synthesize a correct answer — specifically, multi-hop questions where the answer requires chaining evidence across multiple retrieved passages, and compliance or legal contexts where unsupported answers carry real cost. The LangChain agentic-RAG docs characterize this as "an LLM-powered agent decides when and how to retrieve during reasoning," which is the right framing: retrieval is on-demand per reasoning step, not a single pre-generation call.

Pro Tip: Self-RAG is strongest when answer quality matters more than latency and when strict grounding — measured as RAGAS faithfulness — is a hard product requirement. If a 500ms increase in P95 latency is acceptable in exchange for a 15-point faithfulness lift on your evaluation set, Self-RAG's control logic is worth the overhead. If your SLA is sub-300ms, profile the actual call count on your workload first.

CRAG in LangGraph: when retrieval correction is the real problem

CRAG addresses a different failure mode: the initial retrieved documents are weak, ambiguous, or stale. The CRAG paper states directly that "a lightweight retrieval evaluator is to estimate and trigger three knowledge retrieval actions discriminately" and that "large-scale web searches are utilized as an extension for augmenting the retrieval results." The LangGraph implementation encodes this as a linear flow — retrieve, evaluate, branch — before generation starts.

Evaluator score CRAG action LangGraph edge
High relevance Proceed to generation Direct edge to generate node
Ambiguous / partial relevance Rewrite query, re-retrieve Loop edge to query-rewrite node
Low relevance / irrelevant Fall back to web search Edge to web-search tool node

CRAG is the right choice when retrieval ambiguity is the root cause of answer degradation. It adds one evaluator LLM call and, on fallback paths, one external search call — substantially fewer inference passes than Self-RAG's reflection loops.

How the retrieval evaluator gates weak or irrelevant passages

CRAG uses a lightweight retrieval evaluator before generation, and that evaluator routes each query through three actions: pass, rewrite, or fallback. If the evaluator scores the initial passage set as high-quality, no additional retrieval happens and generation proceeds immediately — this is the happy path. If scoring is ambiguous, CRAG rewrites the query and re-retrieves from the same vector store; this path adds one rewrite LLM call and one additional retrieval. If scoring is definitively poor, CRAG discards the vector store results and issues a web search.

Path Retrieval quality signal CRAG action
Pass High relevance Proceed to generation
Rewrite Ambiguous / partial relevance Rewrite query and re-retrieve
Fallback Low relevance / irrelevant Fall back to web search

The score thresholds that separate these three paths are implementation details not standardized in the official LangGraph notebook. In practice, you must calibrate the evaluator against your corpus using your own retrieval hit-rate data before relying on the defaults.

Why web-search fallback matters for ambiguous or stale queries

Web-search fallback is CRAG's core architectural differentiator. For queries about recent events, rapidly changing technical specifications, or topics where your internal knowledge base is sparse, the fallback converts a retrieval failure into a viable answer path. The CRAG paper's abstract frames this as solving the problem of "static and limited corpora" that "yield sub-optimal documents."

Pro Tip: Web-search fallback earns its cost when your query distribution includes a meaningful fraction of freshness-sensitive or out-of-distribution questions. If more than 10–15% of your production queries fall outside your corpus's coverage window, CRAG's fallback can deliver faithfulness lift that reranking alone cannot provide.

Watch Out: Web-search fallback adds variable latency and introduces external vendor dependency. A 2026 reproduction analysis of the original CRAG implementation noted that it relies on the Google Search API, proprietary LLaMA-2 fine-tuned weights, and deprecated OpenAI API calls, signaling that production deployments must validate their search integration independently. If your queries are not freshness-sensitive, the fallback may introduce latency with no quality benefit over reranking the existing passage set.

Benchmarking the trade-off: retrieval quality, latency, and LLM calls

No authoritative source publishes a direct, matched-workload benchmark comparing Self-RAG and CRAG on the same dataset with the same base model — this is a genuine gap in the published record. The table below presents a benchmark template using the evaluation dimensions that LangChain's Ragas + LangSmith evaluation guide recommends: faithfulness, relevancy, and recall. Operators running either pattern should populate these columns against their own evaluation set using LangSmith; the relative ordering reflects the architectural trade-offs, not a single controlled experiment.

Metric Naive RAG (baseline) CRAG (happy path) CRAG (with fallback) Self-RAG
Answer faithfulness baseline template template template
Retrieval hit rate baseline template template template
LLM calls per query 1 2 3 4–6
Relative latency vs. naive RAG template template template

Production Note: These relative orderings hold under typical production conditions, but the absolute numbers depend on your base model, vector index performance, and whether reflection loops actually fire. LangChain explicitly notes that "Ragas provides you with a few insightful metrics, it does not help you in the process of continuously evaluation of your QA pipeline in production — but this is where LangSmith comes in." Treat any single benchmark number as a starting point, not a deployment guarantee.

What to measure in production RAG evaluations

LangChain's evaluation guidance centers three metrics: "Measure faithfulness, relevancy, and recall to build reliable QA systems." For corrective retrieval patterns specifically, a fourth metric matters operationally: correction trigger rate — the fraction of queries that activate the evaluator's rewrite or fallback path (CRAG) or any reflection loop (Self-RAG). A high trigger rate signals corpus coverage problems that neither pattern alone can fix.

Metric What it measures Unit / note Which pattern it stresses
Answer faithfulness Whether the response is grounded in retrieved context score / % Self-RAG (support-check loop)
Retrieval relevance Whether retrieved passages match the query score / % CRAG (evaluator gate)
Recall Whether the correct answer appears in retrieved passages score / % Both equally
Correction trigger rate Fraction of queries activating correction logic % of queries Diagnostic for both
TTFT / P95 latency End-to-end response time at production load ms Self-RAG (more sensitive due to call count)

LangSmith's continuous evaluation capability exists precisely to prevent overfitting to a single benchmark. Run both patterns on a sample of your actual production traffic — not a benchmark proxy — before committing to one. Pay particular attention to fallback hit rate for CRAG and loop iteration count for Self-RAG; these are the leading indicators of operational cost. A 2026 reproduction analysis of the original CRAG paper also noted that implementation discrepancies (paid search APIs, deprecated model weights) can make published accuracy numbers non-reproducible, reinforcing the case for evaluating on your own stack.

Decision matrix: which corrective retrieval pattern should you ship?

Neither Self-RAG nor CRAG is universally better. The CRAG paper describes Self-RAG as a method to "self-correct the results of retriever," and Self-RAG's design uses "reflection tokens" to control retrieval and generation quality. These are complementary solutions to different problems. The decision turns on four variables: latency budget, whether retrieval ambiguity or answer synthesis is the primary failure mode, freshness dependence, and how much orchestration complexity your team can instrument and maintain.

Condition Recommended pattern Rationale
Latency budget > 3× naive RAG AND faithfulness is the KPI Self-RAG Reflection loops target answer grounding directly
Latency budget 1.5–2.5× naive RAG AND retrieval relevance is the failure mode CRAG (no fallback path) One evaluator call is enough
Queries are freshness-sensitive or out-of-corpus CRAG (with web fallback) Web search handles coverage gaps
Multi-hop reasoning across multiple sources required Self-RAG On-demand retrieval per reasoning step
Team has limited LangSmith instrumentation CRAG Fewer trace nodes to monitor
Latency budget ≤ naive RAG AND corpus is high-quality Neither — use reranking Corrective overhead not justified

Choose Self-RAG when control logic and support checks matter most

Self-RAG is the right choice when the primary production complaint is that answers sound plausible but are not grounded — the hallucination problem that LangChain's evaluation framework describes as models generating "plausible-sounding text regardless of factual grounding." The support-check loop gates every answer against the retrieved evidence before the response reaches the user.

Workload signal Self-RAG fit
RAGAS faithfulness below target despite adequate recall Strong fit
Multi-hop questions requiring chained evidence Strong fit
Compliance, legal, or medical contexts where unsupported answers carry cost Strong fit
High-quality, multi-hop workloads with higher orchestration tolerance Strong fit
Latency SLA under 500ms P95 Poor fit without aggressive caching
Primary failure mode is stale or missing knowledge Poor fit — use CRAG

Choose CRAG when ambiguity and freshness dominate

CRAG addresses the case where the vector store itself is the problem — the CRAG paper explicitly targets "sub-optimal documents" from "static and limited corpora." If your production query logs show a high fraction of low-relevance retrievals or freshness failures (queries about events after your index cutoff), CRAG's evaluator-plus-fallback pattern corrects the right failure mode with fewer inference calls than Self-RAG.

Workload signal CRAG fit
High rate of low-relevance retrievals from vector store Strong fit
Queries about recent events or rapidly changing topics Strong fit (with web fallback)
Freshness-sensitive or low-relevance retrieval workloads Strong fit (with web fallback)
Latency budget tighter than Self-RAG allows Moderate fit (happy path is fast)
Corpus is current and high-quality Weak fit — fallback will rarely fire
Primary failure mode is answer hallucination, not retrieval Weak fit — use Self-RAG

Where neither pattern is worth the overhead

If a reranker applied to an existing HNSW-backed Pinecone or Milvus index already meets your faithfulness and recall targets, neither CRAG nor Self-RAG justifies the overhead. LangChain's retrieval documentation describes agentic RAG as one point on a spectrum; basic retrieval remains appropriate when the model does not need to decide dynamically. The first step before adopting either corrective pattern is to measure RAGAS faithfulness, relevancy, and recall on your baseline — "Measure faithfulness, relevancy, and recall to build reliable QA systems" — and confirm that the baseline actually fails on your production distribution.

Watch Out: Corrective retrieval patterns are not free upgrades. Self-RAG can realistically multiply your LLM costs 4–6× per query depending on how often reflection loops fire. CRAG with web fallback adds external API dependency and variable latency. If your baseline RAG plus a cross-encoder reranker passes your quality bar, shipping either corrective pattern adds cost and operational surface without a return.

What the official LangGraph examples leave out

The LangChain agentic RAG blog post is the canonical reference for both patterns and covers their graph structure clearly. What it omits is a production decision framework: there is no guidance on which pattern to choose given a latency budget, no LLM call count comparison under matched workloads, no observability recommendation for distinguishing retrieval failures from synthesis failures, and no cost model for web-search fallback versus reflection loops.

Coverage dimension Official LangChain blog Most derivative guides
Graph structure for each pattern ✅ Full ✅ Partial
Decision criteria for choosing a pattern ❌ Absent ❌ Absent
LLM call count comparison ❌ Absent ❌ Absent
Operational cost model ❌ Absent ❌ Absent
Web-search fallback trade-offs ⚠️ Mentioned, not quantified ❌ Absent
LangSmith observability guidance ⚠️ Referenced ❌ Absent

The post's own framing — "LangGraph can be easily used for 'flow engineering' of self-reflective RAG" — positions it as a flow-engineering tutorial, not a production deployment guide. The gap is operational, not conceptual.

How competitor guides frame only one side of the trade-off

Most derivative explainers for Self-RAG and CRAG cover one pattern per article, naming the architecture and walking through graph nodes without comparing operational cost. A CRAG-only explainer will present the evaluator-plus-fallback flow as the correct approach to corrective RAG; a Self-RAG-only explainer will present reflection tokens as the correct approach. Neither establishes the conditions under which the other pattern is preferable.

Coverage type Self-RAG coverage CRAG coverage Decision framework
Official LangChain blog
Typical derivative tutorials Partial Partial
arXiv papers (per-pattern) ✅ (arXiv:2310.11511) ✅ (arXiv:2401.15884)
This article

The actionable content gap is not explanation of how each pattern works — that material exists — but the comparative decision framework across retrieval quality, latency, LLM call count, and freshness handling.

Why LangSmith observability changes the production decision

LangSmith is not optional instrumentation for corrective retrieval patterns — it is the mechanism that makes the production decision legible. The LangChain evaluation blog is direct: "Ragas provides you with a few insightful metrics, it does not help you in the process of continuously evaluation of your QA pipeline in production — but this is where LangSmith comes in."

Production Note: For Self-RAG, the signals to trace are: how often does the retrieve decision fire, how often does the support check fail and trigger a regeneration loop, and what is the P95 loop iteration count per query. For CRAG, the signals are: evaluator score distribution, rewrite trigger rate, and fallback hit rate. Trace retries, evaluator scores, fallback hits, and loop counts in LangSmith so you can distinguish a retrieval coverage problem from an answer synthesis problem. LangSmith's per-run tracing exposes all of these as inspectable nodes, making it possible to tune evaluator thresholds and reflection graders against real production traffic rather than benchmark proxies.

FAQ on Self-RAG vs CRAG in LangGraph

What is the difference between Self-RAG and CRAG?

Self-RAG uses reflection tokens to make four decisions per query (whether to retrieve, whether passages are relevant, whether the generation is supported, whether the answer is useful) and applies correction throughout the generation process. CRAG uses a lightweight retrieval evaluator to score the initial retrieved passages and routes the query to one of three paths — proceed, rewrite, or fall back to web search — before generation starts. Self-RAG corrects answer synthesis; CRAG corrects retrieval quality.

Is Self-RAG better than CRAG?

Neither pattern dominates. Self-RAG is better when answer faithfulness and hallucination prevention are the primary quality requirements. CRAG is better when retrieval ambiguity, sparse coverage, or stale knowledge is the primary failure mode. The choice depends on where your baseline RAG pipeline actually fails.

Does Self-RAG require more LLM calls?

Yes. A single Self-RAG query can require 4–6 LLM calls depending on how many reflection loops fire. CRAG's happy path requires 2 calls (evaluator + generator); its fallback path requires 3 (evaluator + rewriter + generator after web search). The source material does not publish a universal multiplier because the exact count depends on how often correction actually triggers on a given workload.

When should you use CRAG?

Use CRAG when your vector store returns low-relevance passages on a meaningful fraction of production queries, when your queries are freshness-sensitive (events after your index cutoff), or when your latency budget cannot absorb Self-RAG's reflection loops. CRAG's web-search fallback targets coverage and freshness problems that reranking cannot solve.

How does LangGraph implement agentic RAG?

LangGraph encodes both Self-RAG and CRAG as directed state machines where nodes are LLM calls or tool invocations and edges are conditional on evaluator or grader outputs. Self-RAG's graph is cyclical with multiple grading nodes; CRAG's graph is mostly linear with one conditional branch. LangChain positions LangSmith as the continuous evaluation layer for both.

Pro Tip: If you are unsure which pattern fits your workload, run both on a 100-query sample from your production logs using LangSmith to capture per-trace call counts, evaluator scores, and faithfulness ratings. The comparison that matters is against your data, not against a benchmark proxy. Consult the decision matrix above to map what you find to the right production choice.

Sources & References


Keywords: Self-RAG, CRAG, LangGraph, LangChain, LangSmith, OpenAI, Llama 3.1 70B, Hugging Face Transformers, Milvus, Pinecone, HNSW, web search fallback, retrieval evaluator, reflection tokens, query rewriting

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit