At a glance: what you need to reproduce MAD-M^2
At a Glance: Time: 30–90 minutes depending on model download speed · Prereqs: Linux environment, Python 3.10+, CUDA-capable GPU, Hugging Face account · Hardware: Not specified in the paper or repo · Dataset access:
processed_data/is already prepared, and you can also download datasets withpython download_datasets.py --dataset_name=$DATASET_NAME --dataset_dir=./data· Cost: Model weights are free on Hugging Face; compute costs depend on your infrastructure
Reproducing MAD-M^2 requires four concrete things: a pinned Python environment built around vLLM 0.6.3 for LLM inference, one of the four paper checkpoints downloaded from Hugging Face, a model_path entry in configs.yaml, and either the repo's pre-prepared processed_data/ directory or a fresh run of download_datasets.py. The multi-agent orchestration logic lives in multi_agent_debate.py, which constructs a MultiAgentDebate instance from num_agents, max_round, and prune_strategy arguments. Nothing in the stack is exotic, but version discipline matters: mixing newer vllm or transformers releases will silently break compatibility.
The MAD-MM GitHub repository (accepted at ICLR 2026) ships with pre-processed benchmark data and a clear configuration contract. The core claim is that masking erroneous memories at the start of each debate round — rather than carrying forward the full prior trace — makes multi-agent reasoning more robust across mathematical and logical benchmarks.
What MAD-M^2 changes in multi-agent debate
Standard multi-agent debate (MAD) lets several LLM agents read each other's previous answers and iteratively converge on a solution. The problem is that previous rounds can contain wrong reasoning steps, and every subsequent agent sees that noise as if it were signal. MAD-M^2 adds a single targeted intervention: before an agent processes round $r$, the system filters the memories from round $r-1$, retaining only those assessed as informative and discarding those assessed as erroneous.
As stated in the arXiv abstract: "MAD-M^2 can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories." The method is not a full state reset — agents still receive prior-round context, just a curated subset of it. That distinction matters for multi-agent orchestration: the system remains iterative and context-aware, not amnesiac.
Pro Tip: Memory masking is selective, not ablative. An agent that sees only good prior reasoning can build on it coherently. An agent that sees a full trace polluted by incorrect steps tends to be anchored to those steps — even when it would have reasoned correctly from scratch. The masking step breaks that anchoring without losing the useful debate signal that accumulates across rounds.
The paper reports experiments across mainstream mathematical and logical reasoning benchmarks and concludes that MAD-M^2 consistently outperforms baseline MAD. The improvement is attributed directly to the reduction in erroneous context propagation.
Why erroneous memories hurt debate quality
When baseline MAD passes the full prior debate trace to each agent, wrong intermediate steps get treated as authoritative context. LLMs are prone to anchoring on in-context examples, so an agent receiving a flawed chain-of-thought from a previous round is more likely to reproduce or extend that error than to correct it — even when the correct reasoning path was available. The paper frames this directly: "Erroneous memories remain and can degrade performance."
The repo's three pruning strategies — naive, subjective, and objective — represent different theories of which memories qualify as erroneous. This design implies that memory quality is not a solved problem; it is managed per-run via strategy selection.
Watch Out: If you pass an unfiltered debate trace to the next round, you are not running MAD-M^2 — you are running baseline MAD. The entire performance claim depends on the masking step executing correctly at the start of each round. Confirm your
prune_strategyis set to a non-naivevalue if you intend to reproduce the paper's masked condition, and verify that the strategy is actually filtering context rather than passing through silently.
Where masking happens in the round lifecycle
Masking executes at the beginning of each debate round, before agents receive their input context for that round. The MultiAgentDebate constructor encodes this lifecycle through max_round: the system iterates over rounds, and at each round boundary the selected pruning strategy filters the prior trace. The repo confirms this placement explicitly: "mask erroneous memories from the previous debate round at the beginning of each debate round." (MAD-MM README)
The entry point for a round-based run is multi_agent_debate.py. A minimal invocation looks like:
$ python multi_agent_debate.py \
--dataset MATH \
--num_agents 3 \
--max_round 2 \
--prune_strategy objective
The num_agents and max_round flags are the primary experiment controls for multi-agent orchestration. The prune_strategy flag is what distinguishes MAD-M^2 from baseline MAD.
Repository map: files, configs, and dataset paths
The repo's reproduction surface is intentionally compact. The four locations you need to understand before touching any command are configs.yaml, processed_data/, multi_agent_debate.py, and download_datasets.py. Every other file is either a utility or test infrastructure.
The repository instructions are explicit: "All datasets have been prepared in the processed_data folder. Meanwhile, you can also download these datasets by running: python download_datasets.py --dataset_name=$DATASET_NAME --dataset_dir=./data." (MAD-MM repo) For LLM inference, the model loading path runs through configs.yaml, not a command-line flag: "Then, you can copy the path to modify the model_path variable in the config.yaml file so that the model can be loaded when running the code." (MAD-MM repo) The practical instruction is simple: copy the Hugging Face path into the config file so the model can be loaded when running the code.
Production Note: The repo's scraped snippets reference both
config.yamlandconfigs.yamlin different locations. Before writing your model path, runls *.yamlat the repo root to confirm the exact filename in the working tree. Writing to the wrong file produces a silent load failure rather than an explicit error.
The primary code path for multi-agent orchestration is:
multi_agent_debate.py ← experiment entry point
configs.yaml ← model_path and experiment config
processed_data/ ← pre-prepared benchmark inputs
download_datasets.py ← dataset fetch utility
What multi_agent_debate.py controls
multi_agent_debate.py is the single experiment controller. It parses CLI arguments and passes them directly into the MultiAgentDebate class. The authoritative snippet from the repository shows:
mad = MultiAgentDebate(
agent,
dataset_name=args.dataset,
num_agents=args.num_agents, # number of debating agents
max_round=args.max_round, # total debate rounds per problem
prune_strategy=args.prune_strategy, # naive | subjective | objective
strict=args.strict, # strictness of answer extraction
only_voting=args.only_voting, # skip debate, use voting only
)
For multi-agent orchestration at paper scale, num_agents=3 and max_round=2 are the reference settings. strict controls answer extraction rigor and only_voting disables the full debate loop in favor of a voting-only baseline. The CLI flags beyond these five should be confirmed directly in the repo before building automation on top of them.
How configs.yaml connects model_path to Hugging Face downloads
The model_path field in configs.yaml is the single connection between the repo and the LLM inference backend. Set it to a local checkpoint directory or a Hugging Face model identifier. When vLLM receives a path that does not resolve to a valid local directory, it falls back to downloading from the Hugging Face Hub automatically — a behavior documented in the Qwen vLLM deployment guide: "By default, if the model does not point to a valid local directory, it will download the model files from the Hugging Face Hub."
A minimal configs.yaml entry looks like:
model_path: "Qwen/Qwen2.5-7B-Instruct" # resolves via HF Hub
# or, for a locally cached checkpoint:
# model_path: "/data/models/Qwen2.5-7B-Instruct"
Pro Tip: For authenticated or gated model downloads, run
huggingface-cli loginbefore launching any experiment. Pointing at a remote identifier inside a long multi-agent run means a mid-run network failure can corrupt your results silently.
Set up the Python environment and dependencies
Use a dedicated virtual environment. Do not install the MAD-MM stack into a shared environment — the pinned versions of vllm, transformers, and datasets are specific enough that they will conflict with most existing ML setups.
$ python -m venv madm2_env
$ source madm2_env/bin/activate
$ pip install --upgrade pip
Install the pinned packages without version drift
The repository specifies exact versions for every critical dependency. Install them with explicit pins to prevent any resolver from silently upgrading to a breaking version:
$ pip install \
datasets==3.1.0 \
latex2sympy2==1.9.1 \
regex==2024.11.6 \
sympy==1.13.1 \
tqdm==4.67.0 \
vllm==0.6.3 \
huggingface_hub==0.27.0 \
transformers==4.46.2 \
PyYAML \
numpy \
torch
vllm==0.6.3 and transformers==4.46.2 are the two packages most likely to cause breakage if substituted with current releases. The latex2sympy2==1.9.1 and sympy==1.13.1 pins are required for mathematical expression parsing in the benchmark evaluation loop. Do not substitute newer vllm or transformers versions if you want to match the paper's exact run conditions.
Verify vLLM and Transformers imports before running debates
These import checks are a best-practice preflight rather than an official repository-provided script. Before launching any experiment, confirm the three core libraries import cleanly in the pinned environment. An import error here saves you from a confusing mid-run failure after model loading begins.
import vllm # should import without CUDA errors
import transformers # version should match 4.46.2
import datasets # version should match 3.1.0
print(f"vllm: {vllm.__version__}")
print(f"transformers: {transformers.__version__}")
print(f"datasets: {datasets.__version__}")
A successful preflight produces three version strings matching the pinned requirements. If vllm raises a CUDA or shared-library error on import, the issue is almost always a CUDA toolkit mismatch with the installed torch version — resolve that before proceeding to model loading or LLM inference.
Prepare datasets and benchmark inputs
The repo ships pre-processed benchmark data in processed_data/, so you do not need to build datasets from scratch for a standard reproduction run. The repository states plainly: "All datasets have been prepared in the processed_data folder." (MAD-MM) If you want to use a different dataset split or refresh the data, run the download utility:
$ python download_datasets.py \
--dataset_name=$DATASET_NAME \
--dataset_dir=./data
Replace $DATASET_NAME with the benchmark identifier matching the paper's evaluation tasks. The --dataset_dir flag sets the output directory; confirm the repo's data loader reads from this path or update the path reference in configs.yaml accordingly. Mismatches between the directory the downloader writes to and the directory the loader reads from produce file-not-found errors that can look like dataset corruption.
Use processed_data/ when the benchmark files are already present
If you cloned the repo and processed_data/ is populated, you can skip the download step entirely and point the experiment script at that directory. This is the fastest path to a working run for standard reproduction.
Production Note: Benchmark directory names inside
processed_data/must match exactly what the data loader expects. If you rename or reorganize subdirectories — for example, to store multiple experimental conditions — update the corresponding path references before running. The loader does not fail gracefully on mismatched names; it will either throw aFileNotFoundErroror, worse, silently load an empty dataset.
Map benchmark families to the paper's evaluation tasks
The paper evaluates MAD-M^2 on mainstream mathematical and logical reasoning benchmarks. "Extensive experiments and analyses on mainstream mathematical and logical reasoning..." (arXiv:2603.20215) This reproduction tracking table is set up to reserve paper and repo values as you confirm them locally:
| Benchmark Family | Task Type | Evaluation Format | Paper Result | Repo / Run Result |
|---|---|---|---|---|
| Mathematical reasoning | Numerical / algebraic problem solving | Exact answer match | ||
| Logical reasoning | Deductive / propositional reasoning | Multiple choice or exact match |
Verify the exact benchmark names from the paper PDF before finalizing any results table. The scraped repo snippets do not enumerate every benchmark identifier, and running against a misnamed dataset will produce results that do not correspond to any reported number in the paper.
Load the paper's Hugging Face models correctly
All four paper checkpoints require Hugging Face access. The repo's LLM inference backend (vLLM) handles model loading through the model_path value in configs.yaml. For large checkpoints or gated repositories, authenticate before running:
$ huggingface-cli login
# enter your token at the prompt
$ huggingface-cli download Qwen/QwQ-32B # pre-fetch for offline runs
Watch Out: Large checkpoints and rate-limited Hub endpoints are the most common cause of reproducibility failures that have nothing to do with the code. If your inference node has limited or unreliable outbound network access, pre-download all checkpoints to local storage and set
model_pathto the absolute local path before starting any timed or batch experiment.
Choose the right model family for each run
The paper uses four distinct checkpoints spanning general instruction-tuned and math-specialized model families:
| Model | Type | Parameters | Hugging Face ID |
|---|---|---|---|
| Qwen2.5-7B-Instruct | General instruction | 7B | Qwen/Qwen2.5-7B-Instruct |
| Qwen2.5-Math-7B-Instruct | Math-specialized | 7B | Qwen/Qwen2.5-Math-7B-Instruct |
| DeepSeek-Math-7B-Instruct | Math-specialized | 7B | deepseek-ai/deepseek-math-7b-instruct |
| QwQ-32B | Reasoning-focused | 32B | Qwen/QwQ-32B |
The 7B models are directly comparable to each other; QwQ-32B is a separate scale point with distinct inference cost. "Qwen2.5 is the latest series of Qwen large language models." (Qwen2.5-7B-Instruct model card) The DeepSeek-Math-7B-Instruct repository name should be verified against the official Hugging Face model card before publication, because hub identifiers are case-sensitive. Do not conflate parameter count with runtime memory requirements; the repo does not publish explicit VRAM specifications.
Set model_path for local checkpoints and remote snapshots
The model_path field accepts both local absolute paths and Hugging Face Hub identifiers. The choice determines whether vLLM loads from disk or triggers a network download:
# Remote: vLLM fetches from Hugging Face Hub on first run
model_path: "Qwen/Qwen2.5-Math-7B-Instruct"
# Local: vLLM loads directly from the specified directory
model_path: "/data/models/Qwen2.5-Math-7B-Instruct"
# Local cache path from huggingface-cli download
model_path: "/root/.cache/huggingface/hub/models--Qwen--QwQ-32B/snapshots/REVISION_HASH"
When using remote snapshots, pin a specific revision hash to prevent the Hub from silently serving updated weights that change your results. A local path must point to the directory containing config.json and the model weight shards, not to a parent directory.
Run the reference experiment with three agents and two rounds
The paper's reference configuration uses three agents and two debate rounds. Set prune_strategy to objective or subjective to engage memory masking; naive is functionally equivalent to baseline MAD for context filtering purposes.
$ python multi_agent_debate.py \
--dataset MATH \
--num_agents 3 \
--max_round 2 \
--prune_strategy objective
This command instantiates MultiAgentDebate with the five constructor arguments visible in the repo snippet and runs the full round-based debate loop across the benchmark dataset. Verify the exact --dataset flag value against the benchmark identifiers in processed_data/ before running; mismatched names fail silently in some loader configurations.
Compare naive, subjective, and objective pruning
The three pruning strategies correspond to different assumptions about how to identify erroneous memories:
| Strategy | Behavior | When to use |
|---|---|---|
naive |
Passes the prior trace with minimal or no filtering | Baseline MAD reproduction; establishes the unmasked upper bound on error propagation |
subjective |
Filters memories based on agent self-assessment of prior reasoning quality | When you want the agent itself to judge which context to carry forward |
objective |
Filters memories using an external or rule-based criterion independent of the generating agent | When you want a deterministic or externally validated masking condition |
The paper positions objective and subjective masking as the MAD-M^2 conditions and naive as the baseline. For multi-agent orchestration experiments where you want a clean ablation, run all three strategies on the same dataset split and model checkpoint before drawing conclusions.
Watch Out: The scraped repo sources describe the strategy names but do not expose full implementation semantics for each mode. Inspect the
MultiAgentDebateclass implementation in the repo before concluding thatsubjectiveandobjectivebehave as you expect on a new dataset or model. Strategy behavior may depend on the output format of the specific model family you are using.
Read the output trace and know what to expect
A healthy run produces per-round structured output showing each agent's response and the masking decision applied before the next round. Because the system iterates round by round, the trace should show all agents producing responses and the mask being applied between rounds.
Pro Tip: A stalled trace almost always means model loading failed silently or the network connection to Hugging Face dropped mid-run rather than indicating a reasoning failure. Check that your first round produces output from all three agents before waiting for round two. If the trace shows fewer responses than expected in round one, the model loading or inference path is broken — not the debate logic.
Verify results against the paper's reasoning benchmarks
The paper's central empirical claim is that MAD-M^2 outperforms baseline MAD across mathematical and logical reasoning benchmarks: "Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks..." (arXiv:2603.20215). To verify reproduction, run both the masked and unmasked conditions on the same dataset split and compare accuracy.
The benchmark families and expected evaluation structure are:
| Benchmark Family | Task Type | Metric | MAD-M^2 vs MAD |
|---|---|---|---|
| Mathematical reasoning | Algebraic / numerical | Exact match accuracy | Compare locally against the paper table |
| Logical reasoning | Deductive / propositional | Exact match accuracy | Compare locally against the paper table |
The scraped sources do not provide per-benchmark accuracy deltas, so the "MAD-M^2 vs MAD" column requires the paper PDF to populate correctly. Do not fill it with estimated values.
Check whether memory masking beats baseline MAD
The comparison that validates the method is a direct head-to-head between prune_strategy=naive (baseline MAD) and prune_strategy=objective or subjective (MAD-M^2) on identical inputs. The paper states: "MAD-M^2 ... outperforms MAD." (arXiv:2603.20215)
| Condition | Prune Strategy | Memory Filtering | Expected Outcome |
|---|---|---|---|
| Baseline MAD | naive |
None / minimal | Lower accuracy on reasoning benchmarks |
| MAD-M^2 | subjective |
Agent self-assessed | Higher accuracy per paper claim |
| MAD-M^2 | objective |
External criterion | Higher accuracy per paper claim |
Fill the "Expected Outcome" column with actual accuracy values from your run and from the paper table. Any gap between your numbers and the paper's numbers is the reproduction delta — if it exceeds a few percentage points on the same model and dataset, suspect environment or data path configuration before suspecting the code.
Fix the most common run failures
Watch Out: The three most common failure modes are model loading errors (wrong
model_path), package version conflicts (wrongvllmortransformers), and network failures during Hugging Face downloads. Diagnose them in that order — model path first, then environment, then network. They present differently but are often conflated.
Quick diagnostics for each failure class:
# Check which vllm and transformers are actually installed
$ pip show vllm transformers huggingface_hub
# Verify the model path resolves correctly before launching a run
$ ls /your/model/path/config.json
# Confirm Hugging Face authentication is current
$ huggingface-cli whoami
When vLLM cannot load the checkpoint
A vLLM checkpoint loading failure produces an error in the LLM constructor before any inference occurs. The common causes are a path that does not resolve to a valid model directory and a model architecture or version mismatch with vllm==0.6.3.
from vllm import LLM
# Minimal check: if this raises, the checkpoint path or vLLM version is wrong
llm = LLM(model="/your/model/path") # replace with your actual model_path value
print(llm.llm_engine.model_config.model)
If the model path is not a valid local directory, vLLM falls back to downloading from Hugging Face, as documented in the Qwen vLLM deployment guide. A download that hangs at this stage means network access or authentication is blocking the fetch — not a vLLM bug.
When Transformers or tokenizer files mismatch
Watch Out: The failure pattern for a Transformers version mismatch is usually a
ValueErrororKeyErrorduring tokenizer initialization, not during model weight loading. The error message often names a tokenizer class or config key that the installedtransformersversion does not recognize. The fix is always the same: confirmtransformers==4.46.2is installed and that the model'stokenizer_config.jsonwas downloaded with the same Hugging Face snapshot as the weight files. Stale cache entries — where the tokenizer config was downloaded under a differenttransformersversion — are the most common source of this failure. Clear the local cache directory for the affected model and re-download cleanly.
Pin a specific model revision in huggingface_hub to ensure the tokenizer config and weight shards come from the same snapshot. Mixing a tokenizer config from one commit with weights from another produces inconsistent padding behavior that may not produce an error at load time but will degrade generation quality silently.
What to adapt for your own five-plus-agent system
The MAD-M^2 architecture is parameterized for agent-count scaling via num_agents. Increasing that value beyond three adds more independent reasoning paths per round, which can improve consensus quality on hard problems but also increases inference cost linearly and extends round latency proportionally. The multi-agent orchestration pattern generalizes directly: the masking step is not tied to exactly three agents or two rounds; it is a round-boundary operation that applies at any scale.
The key insight for production adaptation is that memory masking improves robustness only when the masking criterion is reliable. At higher agent counts, the probability that at least one agent produces an erroneous chain-of-thought per round increases, making the masking step more valuable — but also more demanding of the criterion's precision. A naive strategy that was acceptable at three agents may become a bottleneck at seven or ten.
Production Note: Memory masking is not a substitute for trace quality. If the debate agents are using a weak base model or a domain outside their training distribution, the masking step will have fewer high-quality memories to preserve — and performance will plateau regardless of agent count or round depth. Masking helps most when at least some agents per round are producing correct reasoning that would otherwise be diluted by erroneous context from others.
Where this pattern fits research firms, SDR swarms, and data pipelines
The round-based debate structure with per-round context filtering maps directly onto any multi-agent orchestration system that requires iterative critique: a research firm's literature synthesis pipeline where multiple agents cross-check each other's citations, an autonomous SDR system where agents debate lead qualification criteria, or a data labeling pipeline where agents adjudicate disagreements. The repo's use of Hugging Face, vLLM, and a YAML configuration contract mirrors the infrastructure pattern most production LLM inference deployments already use.
Pro Tip: The
only_votingflag inmulti_agent_debate.pylets you disable the full debate loop and run a pure voting baseline. For production systems where latency matters more than per-query accuracy, useonly_voting=Trueas a fast-path option and fall back to full debate only for low-confidence cases. This turns the debate system into an adaptive inference strategy rather than a fixed-cost orchestration layer.
Questions readers ask before reproducing MAD-M^2
What is multi-agent debate in LLMs?
Multi-agent debate is a round-based inference framework where multiple LLM instances each generate an answer to a problem, then each agent receives the other agents' answers as context and revises its own answer. Rounds repeat until a convergence criterion or max_round is reached. The motivation is that disagreement between agents surfaces errors more reliably than any single agent's self-correction.
How does memory masking work in MAD-M^2?
At the start of each debate round, MAD-M^2 evaluates the previous round's debate trace and removes reasoning steps identified as erroneous based on the selected prune_strategy. The remaining context — assessed as informative — is passed to agents for the current round. This happens before agents generate new responses, not after.
How do you run the MAD-MM code on Hugging Face models?
Set model_path in configs.yaml to a Hugging Face model identifier or a local path, then run python multi_agent_debate.py with your chosen --dataset, --num_agents, --max_round, and --prune_strategy flags. vLLM handles model loading; if the path is not local, it downloads from the Hub automatically.
What models are used in multi-agent debate reasoning benchmarks?
The MAD-M^2 paper uses Qwen2.5-7B-Instruct, Qwen2.5-Math-7B-Instruct, DeepSeek-Math-7B-Instruct, and QwQ-32B. The 7B models cover general and math-specialized instruction-following; QwQ-32B is a larger reasoning-focused checkpoint.
Is vLLM required for multi-agent debate systems?
vLLM is required by this repo's implementation — it is the LLM inference backend that loads and runs the checkpoints. Multi-agent debate as a concept does not mandate vLLM; any inference backend that serves a generative model could implement the same round-based debate logic. If you want to reproduce the paper's specific experiments, use vLLM 0.6.3 as pinned.
Sources and references
Production Note: The canonical reproduction references are the arXiv paper (2603.20215) for method description and benchmark claims, and the MAD-MM GitHub repository for code, configuration, and dataset preparation. Use these two sources as ground truth for any discrepancy between this guide and your local run.
- Multi-Agent Debate with Memory Masking — arXiv:2603.20215 — Primary paper describing MAD-M^2 methodology and benchmark results
- MAD-MM GitHub Repository (ICLR 2026) — Implementation reference for all code paths, configuration, and dataset preparation
- vLLM on PyPI — Official package distribution for vLLM 0.6.3
- Qwen vLLM Deployment Documentation — Documents model_path resolution and Hugging Face fallback behavior
- Qwen2.5-7B-Instruct Model Card — Model family description and checkpoint details
- QwQ-32B on Hugging Face — 32B reasoning checkpoint used in the paper
- DeepSeek-Math-7B-Instruct on Hugging Face — Math-specialized 7B checkpoint used in the paper
Keywords: MAD-M^2, MAD-MM, multi_agent_debate.py, vLLM 0.6.3, transformers 4.46.2, huggingface_hub 0.27.0, datasets 3.1.0, Qwen2.5-7B-Instruct, Qwen2.5-Math-7B-Instruct, DeepSeek-Math-7B-Instruct, QwQ-32B, latex2sympy2 1.9.1, sympy 1.13.1, PyYAML, reasoning benchmarks



