AI & ML

SimPO paper explained: what changes when you drop the reference-log-ratio term

Q: How is SimPO different from DPO?

DPO computes implicit rewards as log-ratios between the policy and a frozen reference model: $\beta [\log \pi_\theta(y|x) - \log \pi_{\text{ref}}(y|x)]$. SimPO drops the reference model entirely, using $\frac{\beta}{|y|} \log \pi_\theta(y|x)$ as the reward and adding a margin term $\gamma$ that requires the winning reward to exceed the losing reward by a minimum gap. This removes the memory cost of storing a frozen model and aligns the training reward with how the model actually generates at inference.

Q: Does SimPO outperform DPO on AlpacaEval 2 and MT-Bench?

According to the [arXiv paper (2405.14734)](https://arxiv.org/abs/2405.14734), yes — SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and reports consistent improvements on MT-Bench under the authors' experimental setup. These results are benchmark- and hardware-specific; the repo explicitly warns that results vary with CUDA version and hyperparameter tuning.

Q: Is SimPO reference-free?

Yes. SimPO does not require a reference model at any stage of training. The reward is computed entirely from the policy being trained.

Q: What are the limitations of SimPO?

SimPO requires careful tuning of three coupled hyperparameters (`learning_rate`, `beta`, `gamma`), and beta must be set much higher than typical DPO values. Results do not transfer automatically across hardware configurations or CUDA versions. The paper does not demonstrate robustness at scales above ~9B parameters or across all preference dataset types. There is no universal hyperparameter default that works across model families.

SimPO replaces the reference-log-ratio term with a reference-free reward and the released repo reports stronger results than DPO variants on AlpacaEval 2, MT-Bench, and Arena-Hard — but the authors also caution that performance depends heavily on learning-rate and beta tuning, so the method is not plug-and-play.

By AxiomLogica Editorial

May 10, 202622 min read

Reviewed by Editorial

SimPO paper explained: what changes when you drop the reference-log-ratio term

What SimPO changes in preference optimization

SimPO (Simple Preference Optimization with a Reference-Free Reward) removes the reference model entirely from the preference optimization loop — a change that reduces memory and compute overhead and, according to the arXiv paper (2405.14734), consistently outperforms DPO and its recent variants across AlpacaEval 2, MT-Bench, and Arena-Hard. The Princeton NLP team frames this as a fundamental redesign of the reward signal, not a superficial engineering shortcut: the reference-log-ratio term that anchors DPO's implicit reward is replaced by an average log-probability score computed directly from the policy being trained.

Bottom Line: SimPO eliminates the reference model from preference optimization by replacing DPO's reference-log-ratio reward with a length-normalized average log-probability reward, reporting gains of up to +6.4 points on AlpacaEval 2 and +7.5 points on Arena-Hard — but those numbers depend heavily on tuning learning_rate, beta, and gamma, and do not transfer automatically across hardware stacks.

The paper was presented at NeurIPS 2024, as reflected in the arXiv record for SimPO. The claim is empirically supported on their tested configurations; the repo explicitly cautions that results may vary with hardware and CUDA versions, and that the method is not plug-and-play.

Why DPO's reference model matters in the first place

DPO derives a closed-form preference objective by treating the reference model's log-probability as a normalizing baseline. Given a winning response (y_w) and a losing response (y_l) for prompt $x$, the DPO implicit reward for response $y$ is:

$(r_{\text{DPO}}(x, y) = \beta \left[ \log \pi_\theta(y \mid x) - \log \pi_{\text{ref}}(y \mid x) \right])$

The reference term (\log \pi_{\text{ref}}(y \mid x)) keeps the policy from drifting too far from the supervised fine-tuned checkpoint — it acts as a KL-divergence anchor in reward space. Without it, nothing prevents the policy from assigning arbitrarily high log-probability to any response, regardless of quality. The reference model also provides the relative scoring baseline: a response is rewarded not for its absolute probability but for how much more probable it is than the frozen reference assigns.

This design has a practical cost: every DPO training step requires a forward pass through both the live policy and the reference model, roughly doubling peak GPU memory relative to standard SFT.

Pro Tip: The reference-log-ratio term in DPO does two jobs simultaneously — KL regularization and relative scoring. SimPO must replace both functions without a reference model. Understanding this dual role clarifies why SimPO cannot simply drop the term and call it done; it needs a structural substitute.

Why a reference-free reward is the core design choice

SimPO is reference-free by construction. As the paper abstract states: "The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient."

The key insight is that average log-probability — sequence log-likelihood divided by sequence length — is already well-calibrated to what the model will actually generate at inference time. Greedy decoding and beam search both implicitly favor sequences with high average log-probability per token. DPO's reward, by contrast, scores relative to a frozen checkpoint that no longer participates in generation after SFT. Using average log-probability as the reward closes the gap between training objective and generation behavior without requiring a second model in memory.

For SimPO, the implicit reward for a response $y$ of length $|y|$ is:

$(r_{\text{SimPO}}(x, y) = \frac{\beta}{|y|} \sum_{t=1}^{|y|} \log \pi_\theta(y_t \mid x, y_{<t}))$

This eliminates the reference forward pass entirely.

The SimPO objective, written step by step

The full SimPO loss introduces one additional term beyond the reference-free reward: a target reward margin (\gamma > 0) that ensures the winning response's reward exceeds the losing response's reward by at least (\gamma) before any gradient flows. The objective is:

$(\mathcal{L}{\text{SimPO}}(\pi\theta) = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \frac{\beta}{|y_w|} \log \pi\theta(y_w \mid x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l \mid x) - \gamma \right) \right])$

where:

(\frac{\beta}{|y|} \log \pi_\theta(y \mid x)) is the length-normalized average log-probability reward for response $y$
(\beta > 0) scales the gap between winning and losing rewards (the "sharpness" of the sigmoid)
(\gamma > 0) is the target margin — the minimum reward gap the model must achieve before receiving a positive learning signal
(\sigma(\cdot)) is the logistic sigmoid function
(\mathcal{D}) is the preference dataset of (prompt, chosen, rejected) triples

The SimPO README identifies learning_rate, beta, and gamma as the three hyperparameters requiring the most attention, and recommends holding total batch size fixed at 128 while sweeping the other three.

Benchmark	Concrete figure	Source context
AlpacaEval 2	805 questions from 5 datasets	Benchmark size used in the released evaluation setup
MT-Bench	80 questions across 8 categories	Multi-turn judging setup used in the released evaluation setup
Arena-Hard	500 technical queries	Pairwise hard-instruction benchmark used in the released evaluation setup

From DPO's log-ratio to SimPO's average log-probability score

The structural difference between the two objectives becomes clear when you write them side by side. The DPO loss is:

$(\mathcal{L}{\text{DPO}} = -\mathbb{E} \left[ \log \sigma \left( \beta \underbrace{\left[ \log \frac{\pi\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right]}_{\text{reference-log-ratio term}} \right) \right])$

SimPO removes the reference-log-ratio term and replaces it with length-normalized average log-probabilities plus the margin (\gamma):

$(\mathcal{L}{\text{SimPO}} = -\mathbb{E} \left[ \log \sigma \left( \underbrace{\frac{\beta}{|y_w|} \log \pi\theta(y_w \mid x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l \mid x)}_{\text{reference-free, length-normalized reward}} - \gamma \right) \right])$

The removed term (\log \pi_{\text{ref}}(y \mid x)) is what required the second model forward pass. Once dropped, the optimizer sees only the policy's own log-probabilities — no frozen baseline in the gradient graph. The margin (\gamma) partially compensates for the missing KL anchor by requiring a non-trivial reward gap before the loss contributes a gradient.

What the beta hyperparameter controls

In the SimPO objective, (\beta) controls the reward scaling between the winning and losing responses. As the SimPO repo notes: "beta controls the reward scaling between winning and losing responses" and "SimPO requires a much larger beta than DPO."

Formally, the pre-sigmoid argument is:

$(z = \beta \left( \frac{\log \pi_\theta(y_w \mid x)}{|y_w|} - \frac{\log \pi_\theta(y_l \mid x)}{|y_l|} \right) - \gamma)$

When (\beta) is small, even large differences in average log-probability produce a shallow sigmoid argument, and the gradient is weak regardless of how clearly the model distinguishes (y_w) from (y_l). When (\beta) is large, the sigmoid saturates quickly — the model receives near-zero gradient once it has separated winning and losing rewards past the margin.

Watch Out: Setting (\beta) too large causes the SimPO loss to saturate early in training. Once the policy confidently separates winning from losing rewards by more than (\gamma / \beta), gradients vanish and training stalls. This overconfidence failure mode is especially acute when the preference dataset contains easy pairs where (y_w) and (y_l) are clearly distinguishable. The repo warns not to tune (\beta) in isolation — sweep it jointly with learning_rate and gamma.

Why sequence length normalization changes the gradient signal

Without length normalization, a response's log-probability is a sum over tokens:

$(\log \pi_\theta(y \mid x) = \sum_{t=1}^{|y|} \log \pi_\theta(y_t \mid x, y_{<t}))$

This sum grows with sequence length, which means longer responses receive systematically higher absolute log-probabilities. In DPO, the reference-log-ratio partially cancels this length bias because the reference model has the same length dependence. Once the reference is removed, raw log-probability creates a perverse incentive: the model can increase its reward for (y_w) over (y_l) simply by generating a longer response, independent of quality.

SimPO normalizes by sequence length to correct this bias. The gradient of the SimPO loss with respect to a token-level parameter (\theta) involves:

$(\frac{\partial \mathcal{L}{\text{SimPO}}}{\partial \theta} \propto -\sigma(-z) \cdot \beta \left( \frac{1}{|y_w|} \frac{\partial \log \pi\theta(y_w \mid x)}{\partial \theta} - \frac{1}{|y_l|} \frac{\partial \log \pi_\theta(y_l \mid x)}{\partial \theta} \right))$

The $1/|y|$ factor means each token in a longer sequence contributes a proportionally smaller gradient magnitude. This keeps preference optimization length-neutral: a correct short response and a correct long response apply the same per-token update pressure, preventing the model from learning to pad.

Reported benchmark results from the SimPO repo and paper

The arXiv paper (2405.14734) reports SimPO outperforming DPO by up to 6.4 points on AlpacaEval 2 and up to 7.5 points on Arena-Hard. The top-performing model — Gemma-2-9B-it fine-tuned with SimPO — achieves a 72.4% length-controlled win rate on AlpacaEval 2 and a 59.1% win rate on Arena-Hard.

The evaluation suite used in the repo covers three benchmarks with different coverage:

Benchmark	Format	Size	What it tests
AlpacaEval 2	LLM-as-judge vs. GPT-4	805 questions from 5 datasets	Instruction following, LC win rate
MT-Bench	8-category judge scoring	80 questions	Multi-turn reasoning, writing, coding
Arena-Hard	Pairwise vs. GPT-4 Turbo	500 technical queries	Hard problem-solving, STEM

The paper's headline deltas against DPO baselines, reported across tested model families:

Model base	Benchmark	SimPO	DPO	Delta
Gemma-2-9B-it	AlpacaEval 2 LC win rate	72.4%	—	top-1 <10B
Gemma-2-9B-it	Arena-Hard win rate	59.1%	—	top-1 <10B
Various	AlpacaEval 2	vs. DPO	baseline	up to +6.4 pts
Various	Arena-Hard	vs. DPO	baseline	up to +7.5 pts

Watch Out: The AlpacaEval 2 evaluation pipeline has undergone a major revision for vLLM decoding since version 0.6.3. The SimPO repo explicitly notes this creates a discrepancy relative to the authors' original experiments. If you run AlpacaEval 2 with a newer vLLM version, numbers will not match the paper directly.

Where SimPO looks strongest relative to DPO variants

As the SimPO README states: "SimPO outperforms DPO and its latest variants across AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks under various settings." The strongest reported deltas occur on Arena-Hard (+7.5 pts) and AlpacaEval 2 (+6.4 pts), which are the two benchmarks most sensitive to instruction-following quality and response calibration. MT-Bench shows improvements but smaller absolute margins given its smaller question pool and the noisier judge signal from 80 items.

Setting	Metric	SimPO advantage over DPO
Best overall (Gemma-2-9B-it)	AlpacaEval 2 LC win rate	72.4% absolute
Best overall (Gemma-2-9B-it)	Arena-Hard win rate	59.1% absolute
Across all tested configs	AlpacaEval 2 delta	up to +6.4 pts
Across all tested configs	Arena-Hard delta	up to +7.5 pts

SimPO's margin is largest when the preference dataset contains high-quality chosen/rejected pairs with non-trivial length variation — exactly the regime where length normalization makes the reward signal cleaner than DPO's unnormalized log-ratio.

Why benchmark wins do not automatically transfer across labs

The Princeton NLP team is explicit about this: "Hyperparameter tuning is crucial for SimPO (and other preference optimization algorithms in general)." The three primary knobs are learning_rate, beta, and gamma. Beyond hyperparameters, the repo states that "results may still vary due to differences in hardware configurations and CUDA versions, etc."

The sources of cross-lab variance are cumulative: different GPU generations (H100 vs. A100 vs. consumer hardware) change numerics slightly through floating-point accumulation order in FlashAttention 2; CUDA versions affect kernel selection in fused attention implementations; and the SimPO trainer itself was updated for better reproducibility after initial release, meaning runs against an older commit may not match current code.

Watch Out: Do not treat the paper's reported numbers as a floor you can expect to clear by default. The +6.4 AlpacaEval 2 delta and +7.5 Arena-Hard delta are best-case results under the authors' tuned hyperparameters on their hardware. Out-of-the-box runs with DPO-scale beta values will underperform because SimPO requires a much larger beta than DPO — the reward scaling is fundamentally different.

Pro Tip: Before starting a fresh SimPO run, load the authors' released checkpoints and evaluate them in your environment using the repo's eval scripts. If their checkpoint does not reproduce their reported scores on your stack, the gap is infrastructure — not hyperparameters. Fix the environment first. The repo ships conda environment files specifically for this diagnostic step. TRL also provides a SimPO trainer implementation; verify which code path matches the released checkpoints before tuning.

How SimPO compares with DPO, ORPO, and KTO in practice

The SimPO paper explicitly compares against DPO and its variants. Direct benchmark comparisons against ORPO and KTO do not appear in the primary SimPO sources, so the table below uses general engineering characterization for those two methods rather than specific scores.

Method	Objective form	Reference model	Margin term	Tuning burden	Memory vs. SFT
DPO	Log-ratio of policy vs. ref	Required (frozen copy)	None	Moderate (lr, beta)	~2×
SimPO	Avg log-prob difference + margin	Not required	(\gamma) required	High (lr, beta, gamma)	~1×
ORPO	Reference-free preference objective	Not required	None	Moderate	Lower than reference-based methods
KTO	Preference-style objective for non-paired signals	Not required in the SimPO paper context	None	Moderate	Lower than reference-based methods

ORPO is closest in spirit to SimPO in that both are reference-free, but ORPO integrates a supervised fine-tuning objective directly into its loss. KTO addresses a different data regime than SimPO targets, so it is better treated as a separate option rather than a drop-in baseline.

When SimPO is the better choice

Choose SimPO when you have a well-curated paired preference dataset (chosen/rejected), your team can run a proper hyperparameter sweep over learning_rate, beta, and gamma, and you want to eliminate the memory overhead of a frozen reference model on H100-class hardware where that overhead is the limiting constraint.
Choose SimPO when you are optimizing for AlpacaEval 2 or Arena-Hard as primary eval targets — the reported gains are largest on these benchmarks.
Choose SimPO when your base model is a strong instruction-tuned checkpoint (e.g., Llama 3 8B Instruct or Gemma-2-9B-it) that already generates coherent responses; the length-normalized reward signal works best when the policy is not in a degenerate initial state.

Pro Tip: The TRL library includes a SimPOTrainer. Start with the authors' recommended batch size of 128 total, then grid-search beta in the range 2–20 (much higher than typical DPO beta of 0.1–0.5) and gamma in the range 0.5–1.5 before touching learning_rate.

Scenario	Recommended method
Paired dataset, budget for tuning sweep, memory-constrained	SimPO
Paired dataset, no tuning budget, H100 cluster available	DPO
Unpaired labels (thumbs up/down per response)	KTO
Want SFT + preference in one pass, reference-free	ORPO

When DPO, ORPO, or KTO may still be safer

Choose DPO when you need reproducible baseline behavior with minimal tuning. DPO's two-hyperparameter surface (lr, beta) is well-documented across the community, and the reference model provides an implicit stability floor that SimPO's margin term only partially replaces.
Choose ORPO when you want to combine supervised fine-tuning and preference optimization in a single training pass without a reference model and without a dedicated hyperparameter sweep for a margin term.
Choose KTO when your annotation pipeline produces per-response quality signals rather than pairwise comparisons — KTO is designed for this data regime and DPO and SimPO are not.

Scenario	Avoid SimPO because...
No tuning compute budget	Requires sweep over 3 coupled hyperparameters
Unpaired preference data	Objective requires explicit (chosen, rejected) pairs
Need cross-lab reproducibility guarantee	Hardware/CUDA sensitivity; no universal defaults
Short, low-diversity response distribution	Length normalization advantage is marginal

Reproduction notes from the released SimPO repository

The SimPO repository provides the practical scaffolding needed to attempt reproduction: released model checkpoints, training curves, evaluation scripts, and conda environment files. The eval README documents benchmark sizes — AlpacaEval 2 at 805 questions from 5 datasets, MT-Bench at 80 questions across 8 categories, Arena-Hard at 500 queries — and gives evaluation commands for each.

The repo's own trainer was updated for improved reproducibility after initial release, which means the current codebase may not produce identical training curves to earlier public runs. Any reproduction attempt should pin to a specific commit and verify against a released checkpoint before sweeping hyperparameters.

Resource	What it covers
Released checkpoints	Released SimPO checkpoints linked from the README
Training curves	Loss and reward margin curves for reported runs
Environment files	Conda YAML files matching the authors' hardware setup
eval/README.md	Benchmark evaluation commands for AlpacaEval 2, MT-Bench, Arena-Hard

Pro Tip: Match the CUDA and driver version from the repo's environment file before running evals, not just the Python package versions. This is a known source of the reproduction variance the repo warns about. When using TRL's SimPOTrainer, confirm it matches the Princeton NLP trainer's reward computation exactly — particularly how it handles padding token exclusion in the average log-probability calculation.

What the repo provides for reproducibility

The repo releases: model checkpoints (accessible via the link in the README), training curves for the main reported experiments, conda environment files for the authors' hardware configuration, and an evaluation directory with scripts for all three benchmarks. As the README states: "Please find all the released model checkpoints at this link."

Production Note: Checkpoint files are the most reliable reproduction entry point. Evaluate the released checkpoint first to establish a ground-truth score on your infrastructure. Only after confirming alignment between the checkpoint's score on your eval stack and the paper's reported score should you proceed to train from scratch. Discrepancy at the checkpoint-eval stage signals an infrastructure or benchmark-version mismatch, not a training problem.

The evaluation README confirms: "We provide details on the evaluation of the models in this directory." The benchmark scripts cover AlpacaEval 2 (including the judge model call), MT-Bench (GPT-4 judge over 8 categories), and Arena-Hard (pairwise against GPT-4 Turbo baseline).

What the repo does not solve for you

The repo's guidance is honest about its scope. As the README states: "Hyperparameter tuning is crucial for SimPO (and other preference optimization algorithms in general)." What the repo does not provide is a set of universally valid hyperparameters — the released settings were tuned for the authors' specific model families, dataset, and hardware.

Watch Out: Neither SimPO nor DPO ships with hyperparameter defaults that generalize across model families. The SimPO beta values that work for Gemma-2-9B-it will not work for a 1B or 70B model, and may not work for Llama-family models even at the same scale. The repo's tuning guidance is a starting point, not a recipe. Plan for at minimum a 2×3 grid search (two learning rates × three beta values at fixed gamma) before drawing conclusions about whether SimPO is helping on your specific setup.

What practitioners should take away from SimPO

SimPO establishes that the reference model is not a required component of preference optimization — the average log-probability of a sequence, normalized by length and separated by a margin, can substitute both the KL anchor and the relative scoring function that the reference-log-ratio term provided in DPO. The empirical case is credible: the paper reports consistent improvements over DPO and its variants across three alignment benchmarks under various settings, with headline deltas of +6.4 AlpacaEval 2 points and +7.5 Arena-Hard points, and the top Gemma-2-9B-it model achieving a 72.4% length-controlled win rate on AlpacaEval 2. As the paper states: "Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length."

The practitioner's decision is not whether SimPO is theoretically better, but whether the tuning cost of its three-dimensional hyperparameter space (learning_rate, beta, gamma) is worth the potential gain over a well-tuned DPO baseline in their specific lab setup.

Condition	Recommendation
Memory is the bottleneck, team can run sweeps	Test SimPO first
No sweep budget, need reliable baseline	Use DPO
Unpaired preference data	Use KTO
Want SFT + alignment in one stage	Use ORPO
Reproducing the paper's exact results	Match CUDA, environment file, and eval version

Bottom Line: SimPO is a credible upgrade path from DPO for preference optimization on paired datasets when you have the compute budget to tune beta and gamma properly. Its reference-free design saves memory and closes the training-generation gap, and the alignment benchmark results are the strongest reported for the <10B class. Do not adopt it as a drop-in replacement without a tuning run — the method's sensitivity to beta (which must be set much higher than DPO's beta) makes default-parameter comparisons meaningless.

A pragmatic adoption rule for alignment runs

Run SimPO if and only if you can evaluate on MT-Bench and Arena-Hard after every hyperparameter configuration, not just at the end of training. MT-Bench's 80-question, 8-category structure catches regressions on multi-turn reasoning that single-turn benchmarks miss. Arena-Hard's 500 technical queries are the strongest proxy for hard instruction-following quality in the <10B regime.

Pro Tip: Use MT-Bench as your fast-feedback signal during the beta/gamma sweep (it runs in minutes on a single GPU with a judge call budget under $5) and Arena-Hard as the final gate before committing to a full training run. If a SimPO configuration beats your DPO baseline on MT-Bench but loses on Arena-Hard, the gamma margin is likely too small — increase it before concluding SimPO underperforms.

Open questions for follow-up experiments

The verified sources establish benchmark superiority within the paper's tested configurations, but several questions remain open:

Watch Out: The SimPO paper does not demonstrate robustness across model scales beyond the tested families (primarily 7B–9B). Scaling behavior — whether the length-normalized reward and the margin term interact differently with 70B-scale models or with smaller 1B models — is not established. Similarly, the benefit of removing the reference model has only been demonstrated on specific hardware (NVIDIA H100, authors' CUDA version). Whether the gradient dynamics reported in the paper hold on A100 clusters or mixed-precision training with bf16 vs. fp16 remains unverified.

Open questions worth running:

Does SimPO's advantage over DPO hold at 70B scale with LoRA-only fine-tuning, or does the missing KL anchor cause instability at larger parameter counts?
How sensitive is the Arena-Hard win rate to the choice of (\gamma) across different dataset sources (UltraFeedback vs. Anthropic HH vs. domain-specific preference data)?
Does the length-normalization advantage diminish on datasets where chosen and rejected responses have similar lengths?
Can SimPO and DPO be ensembled (e.g., train two checkpoints, merge via model averaging) to capture both the reference anchor stability and the generation-aligned reward?

FAQ

What is SimPO in machine learning?

SimPO (Simple Preference Optimization with a Reference-Free Reward) is a preference optimization algorithm for fine-tuning language models on paired preference data. It replaces the reference model used in DPO with a length-normalized average log-probability reward and a target reward margin, eliminating the need for a second forward pass during training.

How is SimPO different from DPO?

DPO computes implicit rewards as log-ratios between the policy and a frozen reference model: (\beta [\log \pi_\theta(y|x) - \log \pi_{\text{ref}}(y|x)]). SimPO drops the reference model entirely, using (\frac{\beta}{|y|} \log \pi_\theta(y|x)) as the reward and adding a margin term (\gamma) that requires the winning reward to exceed the losing reward by a minimum gap. This removes the memory cost of storing a frozen model and aligns the training reward with how the model actually generates at inference.

Does SimPO outperform DPO on AlpacaEval 2 and MT-Bench?

According to the arXiv paper (2405.14734), yes — SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and reports consistent improvements on MT-Bench under the authors' experimental setup. These results are benchmark- and hardware-specific; the repo explicitly warns that results vary with CUDA version and hyperparameter tuning.

Is SimPO reference-free?

Yes. SimPO does not require a reference model at any stage of training. The reward is computed entirely from the policy being trained.

What are the limitations of SimPO?

SimPO requires careful tuning of three coupled hyperparameters (learning_rate, beta, gamma), and beta must be set much higher than typical DPO values. Results do not transfer automatically across hardware configurations or CUDA versions. The paper does not demonstrate robustness at scales above ~9B parameters or across all preference dataset types. There is no universal hyperparameter default that works across model families.

Sources and references

SimPO: Simple Preference Optimization with a Reference-Free Reward — Princeton NLP GitHub Repository — Primary source for training objective, benchmark results, hyperparameter guidance, environment files, and released checkpoints
arXiv:2405.14734 — SimPO paper — The NeurIPS 2024 paper; source for reported benchmark numbers, abstract claims, and methodology
SimPO eval/README.md — Documents benchmark suite sizes and evaluation commands for AlpacaEval 2, MT-Bench, and Arena-Hard
TRL (Transformer Reinforcement Learning) — Hugging Face — Library providing SimPOTrainer implementation for training with SimPO
DPO: Direct Preference Optimization — arXiv:2305.18290 — Original DPO paper, baseline method SimPO is designed to improve upon

Pro Tip: When attempting reproduction, start with the released checkpoints and the repo's conda environment files rather than installing dependencies independently. Confirm your eval numbers match the paper's reported scores for the checkpoint before running any new training — this validates your infrastructure before you attribute any performance gap to SimPO versus your tuning choices. The TRL SimPOTrainer is the most accessible starting point for teams already using the Hugging Face ecosystem.

Keywords: SimPO, DPO, ORPO, KTO, TRL, OpenRLHF, Axolotl, AlpacaEval 2, MT-Bench, Arena-Hard, Llama 3 8B Instruct, Flash Attention 2, NVIDIA H100, CUDA, NeurIPS 2024

Was this guide helpful?

Share: X · LinkedIn · Reddit