What SimPO changes in preference optimization
SimPO (Simple Preference Optimization with a Reference-Free Reward) removes the reference model entirely from the preference optimization loop — a change that reduces memory and compute overhead and, according to the arXiv paper (2405.14734), consistently outperforms DPO and its recent variants across AlpacaEval 2, MT-Bench, and Arena-Hard. The Princeton NLP team frames this as a fundamental redesign of the reward signal, not a superficial engineering shortcut: the reference-log-ratio term that anchors DPO's implicit reward is replaced by an average log-probability score computed directly from the policy being trained.
Bottom Line: SimPO eliminates the reference model from preference optimization by replacing DPO's reference-log-ratio reward with a length-normalized average log-probability reward, reporting gains of up to +6.4 points on AlpacaEval 2 and +7.5 points on Arena-Hard — but those numbers depend heavily on tuning
learning_rate,beta, andgamma, and do not transfer automatically across hardware stacks.
The paper was presented at NeurIPS 2024, as reflected in the arXiv record for SimPO. The claim is empirically supported on their tested configurations; the repo explicitly cautions that results may vary with hardware and CUDA versions, and that the method is not plug-and-play.
Why DPO's reference model matters in the first place
DPO derives a closed-form preference objective by treating the reference model's log-probability as a normalizing baseline. Given a winning response (y_w) and a losing response (y_l) for prompt $x$, the DPO implicit reward for response $y$ is:
$(r_{\text{DPO}}(x, y) = \beta \left[ \log \pi_\theta(y \mid x) - \log \pi_{\text{ref}}(y \mid x) \right])$
The reference term (\log \pi_{\text{ref}}(y \mid x)) keeps the policy from drifting too far from the supervised fine-tuned checkpoint — it acts as a KL-divergence anchor in reward space. Without it, nothing prevents the policy from assigning arbitrarily high log-probability to any response, regardless of quality. The reference model also provides the relative scoring baseline: a response is rewarded not for its absolute probability but for how much more probable it is than the frozen reference assigns.
This design has a practical cost: every DPO training step requires a forward pass through both the live policy and the reference model, roughly doubling peak GPU memory relative to standard SFT.
Pro Tip: The reference-log-ratio term in DPO does two jobs simultaneously — KL regularization and relative scoring. SimPO must replace both functions without a reference model. Understanding this dual role clarifies why SimPO cannot simply drop the term and call it done; it needs a structural substitute.
Why a reference-free reward is the core design choice
SimPO is reference-free by construction. As the paper abstract states: "The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient."
The key insight is that average log-probability — sequence log-likelihood divided by sequence length — is already well-calibrated to what the model will actually generate at inference time. Greedy decoding and beam search both implicitly favor sequences with high average log-probability per token. DPO's reward, by contrast, scores relative to a frozen checkpoint that no longer participates in generation after SFT. Using average log-probability as the reward closes the gap between training objective and generation behavior without requiring a second model in memory.
For SimPO, the implicit reward for a response $y$ of length $|y|$ is:
$(r_{\text{SimPO}}(x, y) = \frac{\beta}{|y|} \sum_{t=1}^{|y|} \log \pi_\theta(y_t \mid x, y_{<t}))$
This eliminates the reference forward pass entirely.
The SimPO objective, written step by step
The full SimPO loss introduces one additional term beyond the reference-free reward: a target reward margin (\gamma > 0) that ensures the winning response's reward exceeds the losing response's reward by at least (\gamma) before any gradient flows. The objective is:
$(\mathcal{L}{\text{SimPO}}(\pi\theta) = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \frac{\beta}{|y_w|} \log \pi\theta(y_w \mid x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l \mid x) - \gamma \right) \right])$
where:
- (\frac{\beta}{|y|} \log \pi_\theta(y \mid x)) is the length-normalized average log-probability reward for response $y$
- (\beta > 0) scales the gap between winning and losing rewards (the "sharpness" of the sigmoid)
- (\gamma > 0) is the target margin — the minimum reward gap the model must achieve before receiving a positive learning signal
- (\sigma(\cdot)) is the logistic sigmoid function
- (\mathcal{D}) is the preference dataset of (prompt, chosen, rejected) triples
The SimPO README identifies learning_rate, beta, and gamma as the three hyperparameters requiring the most attention, and recommends holding total batch size fixed at 128 while sweeping the other three.
| Benchmark | Concrete figure | Source context |
|---|---|---|
| AlpacaEval 2 | 805 questions from 5 datasets | Benchmark size used in the released evaluation setup |
| MT-Bench | 80 questions across 8 categories | Multi-turn judging setup used in the released evaluation setup |
| Arena-Hard | 500 technical queries | Pairwise hard-instruction benchmark used in the released evaluation setup |
From DPO's log-ratio to SimPO's average log-probability score
The structural difference between the two objectives becomes clear when you write them side by side. The DPO loss is:
$(\mathcal{L}{\text{DPO}} = -\mathbb{E} \left[ \log \sigma \left( \beta \underbrace{\left[ \log \frac{\pi\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right]}_{\text{reference-log-ratio term}} \right) \right])$
SimPO removes the reference-log-ratio term and replaces it with length-normalized average log-probabilities plus the margin (\gamma):
$(\mathcal{L}{\text{SimPO}} = -\mathbb{E} \left[ \log \sigma \left( \underbrace{\frac{\beta}{|y_w|} \log \pi\theta(y_w \mid x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l \mid x)}_{\text{reference-free, length-normalized reward}} - \gamma \right) \right])$
The removed term (\log \pi_{\text{ref}}(y \mid x)) is what required the second model forward pass. Once dropped, the optimizer sees only the policy's own log-probabilities — no frozen baseline in the gradient graph. The margin (\gamma) partially compensates for the missing KL anchor by requiring a non-trivial reward gap before the loss contributes a gradient.
What the beta hyperparameter controls
In the SimPO objective, (\beta) controls the reward scaling between the winning and losing responses. As the SimPO repo notes: "beta controls the reward scaling between winning and losing responses" and "SimPO requires a much larger beta than DPO."
Formally, the pre-sigmoid argument is:
$(z = \beta \left( \frac{\log \pi_\theta(y_w \mid x)}{|y_w|} - \frac{\log \pi_\theta(y_l \mid x)}{|y_l|} \right) - \gamma)$
When (\beta) is small, even large differences in average log-probability produce a shallow sigmoid argument, and the gradient is weak regardless of how clearly the model distinguishes (y_w) from (y_l). When (\beta) is large, the sigmoid saturates quickly — the model receives near-zero gradient once it has separated winning and losing rewards past the margin.
Watch Out: Setting (\beta) too large causes the SimPO loss to saturate early in training. Once the policy confidently separates winning from losing rewards by more than (\gamma / \beta), gradients vanish and training stalls. This overconfidence failure mode is especially acute when the preference dataset contains easy pairs where (y_w) and (y_l) are clearly distinguishable. The repo warns not to tune (\beta) in isolation — sweep it jointly with
learning_rateandgamma.
Why sequence length normalization changes the gradient signal
Without length normalization, a response's log-probability is a sum over tokens:
$(\log \pi_\theta(y \mid x) = \sum_{t=1}^{|y|} \log \pi_\theta(y_t \mid x, y_{<t}))$
This sum grows with sequence length, which means longer responses receive systematically higher absolute log-probabilities. In DPO, the reference-log-ratio partially cancels this length bias because the reference model has the same length dependence. Once the reference is removed, raw log-probability creates a perverse incentive: the model can increase its reward for (y_w) over (y_l) simply by generating a longer response, independent of quality.
SimPO normalizes by sequence length to correct this bias. The gradient of the SimPO loss with respect to a token-level parameter (\theta) involves:
$(\frac{\partial \mathcal{L}{\text{SimPO}}}{\partial \theta} \propto -\sigma(-z) \cdot \beta \left( \frac{1}{|y_w|} \frac{\partial \log \pi\theta(y_w \mid x)}{\partial \theta} - \frac{1}{|y_l|} \frac{\partial \log \pi_\theta(y_l \mid x)}{\partial \theta} \right))$
The $1/|y|$ factor means each token in a longer sequence contributes a proportionally smaller gradient magnitude. This keeps preference optimization length-neutral: a correct short response and a correct long response apply the same per-token update pressure, preventing the model from learning to pad.
Reported benchmark results from the SimPO repo and paper
The arXiv paper (2405.14734) reports SimPO outperforming DPO by up to 6.4 points on AlpacaEval 2 and up to 7.5 points on Arena-Hard. The top-performing model — Gemma-2-9B-it fine-tuned with SimPO — achieves a 72.4% length-controlled win rate on AlpacaEval 2 and a 59.1% win rate on Arena-Hard.
The evaluation suite used in the repo covers three benchmarks with different coverage:
| Benchmark | Format | Size | What it tests |
|---|---|---|---|
| AlpacaEval 2 | LLM-as-judge vs. GPT-4 | 805 questions from 5 datasets | Instruction following, LC win rate |
| MT-Bench | 8-category judge scoring | 80 questions | Multi-turn reasoning, writing, coding |
| Arena-Hard | Pairwise vs. GPT-4 Turbo | 500 technical queries | Hard problem-solving, STEM |
The paper's headline deltas against DPO baselines, reported across tested model families:
| Model base | Benchmark | SimPO | DPO | Delta |
|---|---|---|---|---|
| Gemma-2-9B-it | AlpacaEval 2 LC win rate | 72.4% | — | top-1 <10B |
| Gemma-2-9B-it | Arena-Hard win rate | 59.1% | — | top-1 <10B |
| Various | AlpacaEval 2 | vs. DPO | baseline | up to +6.4 pts |
| Various | Arena-Hard | vs. DPO | baseline | up to +7.5 pts |
Watch Out: The AlpacaEval 2 evaluation pipeline has undergone a major revision for vLLM decoding since version 0.6.3. The SimPO repo explicitly notes this creates a discrepancy relative to the authors' original experiments. If you run AlpacaEval 2 with a newer vLLM version, numbers will not match the paper directly.
Where SimPO looks strongest relative to DPO variants
As the SimPO README states: "SimPO outperforms DPO and its latest variants across AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks under various settings." The strongest reported deltas occur on Arena-Hard (+7.5 pts) and AlpacaEval 2 (+6.4 pts), which are the two benchmarks most sensitive to instruction-following quality and response calibration. MT-Bench shows improvements but smaller absolute margins given its smaller question pool and the noisier judge signal from 80 items.
| Setting | Metric | SimPO advantage over DPO |
|---|---|---|
| Best overall (Gemma-2-9B-it) | AlpacaEval 2 LC win rate | 72.4% absolute |
| Best overall (Gemma-2-9B-it) | Arena-Hard win rate | 59.1% absolute |
| Across all tested configs | AlpacaEval 2 delta | up to +6.4 pts |
| Across all tested configs | Arena-Hard delta | up to +7.5 pts |
SimPO's margin is largest when the preference dataset contains high-quality chosen/rejected pairs with non-trivial length variation — exactly the regime where length normalization makes the reward signal cleaner than DPO's unnormalized log-ratio.
Why benchmark wins do not automatically transfer across labs
The Princeton NLP team is explicit about this: "Hyperparameter tuning is crucial for SimPO (and other preference optimization algorithms in general)." The three primary knobs are learning_rate, beta, and gamma. Beyond hyperparameters, the repo states that "results may still vary due to differences in hardware configurations and CUDA versions, etc."
The sources of cross-lab variance are cumulative: different GPU generations (H100 vs. A100 vs. consumer hardware) change numerics slightly through floating-point accumulation order in FlashAttention 2; CUDA versions affect kernel selection in fused attention implementations; and the SimPO trainer itself was updated for better reproducibility after initial release, meaning runs against an older commit may not match current code.
Watch Out: Do not treat the paper's reported numbers as a floor you can expect to clear by default. The +6.4 AlpacaEval 2 delta and +7.5 Arena-Hard delta are best-case results under the authors' tuned hyperparameters on their hardware. Out-of-the-box runs with DPO-scale beta values will underperform because SimPO requires a much larger beta than DPO — the reward scaling is fundamentally different.
Pro Tip: Before starting a fresh SimPO run, load the authors' released checkpoints and evaluate them in your environment using the repo's eval scripts. If their checkpoint does not reproduce their reported scores on your stack, the gap is infrastructure — not hyperparameters. Fix the environment first. The repo ships conda environment files specifically for this diagnostic step. TRL also provides a SimPO trainer implementation; verify which code path matches the released checkpoints before tuning.
How SimPO compares with DPO, ORPO, and KTO in practice
The SimPO paper explicitly compares against DPO and its variants. Direct benchmark comparisons against ORPO and KTO do not appear in the primary SimPO sources, so the table below uses general engineering characterization for those two methods rather than specific scores.
| Method | Objective form | Reference model | Margin term | Tuning burden | Memory vs. SFT |
|---|---|---|---|---|---|
| DPO | Log-ratio of policy vs. ref | Required (frozen copy) | None | Moderate (lr, beta) | ~2× |
| SimPO | Avg log-prob difference + margin | Not required | (\gamma) required | High (lr, beta, gamma) | ~1× |
| ORPO | Reference-free preference objective | Not required | None | Moderate | Lower than reference-based methods |
| KTO | Preference-style objective for non-paired signals | Not required in the SimPO paper context | None | Moderate | Lower than reference-based methods |
ORPO is closest in spirit to SimPO in that both are reference-free, but ORPO integrates a supervised fine-tuning objective directly into its loss. KTO addresses a different data regime than SimPO targets, so it is better treated as a separate option rather than a drop-in baseline.
When SimPO is the better choice
- Choose SimPO when you have a well-curated paired preference dataset (chosen/rejected), your team can run a proper hyperparameter sweep over
learning_rate,beta, andgamma, and you want to eliminate the memory overhead of a frozen reference model on H100-class hardware where that overhead is the limiting constraint. - Choose SimPO when you are optimizing for AlpacaEval 2 or Arena-Hard as primary eval targets — the reported gains are largest on these benchmarks.
- Choose SimPO when your base model is a strong instruction-tuned checkpoint (e.g., Llama 3 8B Instruct or Gemma-2-9B-it) that already generates coherent responses; the length-normalized reward signal works best when the policy is not in a degenerate initial state.
Pro Tip: The TRL library includes a
SimPOTrainer. Start with the authors' recommended batch size of 128 total, then grid-searchbetain the range 2–20 (much higher than typical DPO beta of 0.1–0.5) andgammain the range 0.5–1.5 before touchinglearning_rate.
| Scenario | Recommended method |
|---|---|
| Paired dataset, budget for tuning sweep, memory-constrained | SimPO |
| Paired dataset, no tuning budget, H100 cluster available | DPO |
| Unpaired labels (thumbs up/down per response) | KTO |
| Want SFT + preference in one pass, reference-free | ORPO |
When DPO, ORPO, or KTO may still be safer
- Choose DPO when you need reproducible baseline behavior with minimal tuning. DPO's two-hyperparameter surface (
lr,beta) is well-documented across the community, and the reference model provides an implicit stability floor that SimPO's margin term only partially replaces. - Choose ORPO when you want to combine supervised fine-tuning and preference optimization in a single training pass without a reference model and without a dedicated hyperparameter sweep for a margin term.
- Choose KTO when your annotation pipeline produces per-response quality signals rather than pairwise comparisons — KTO is designed for this data regime and DPO and SimPO are not.
| Scenario | Avoid SimPO because... |
|---|---|
| No tuning compute budget | Requires sweep over 3 coupled hyperparameters |
| Unpaired preference data | Objective requires explicit (chosen, rejected) pairs |
| Need cross-lab reproducibility guarantee | Hardware/CUDA sensitivity; no universal defaults |
| Short, low-diversity response distribution | Length normalization advantage is marginal |
Reproduction notes from the released SimPO repository
The SimPO repository provides the practical scaffolding needed to attempt reproduction: released model checkpoints, training curves, evaluation scripts, and conda environment files. The eval README documents benchmark sizes — AlpacaEval 2 at 805 questions from 5 datasets, MT-Bench at 80 questions across 8 categories, Arena-Hard at 500 queries — and gives evaluation commands for each.
The repo's own trainer was updated for improved reproducibility after initial release, which means the current codebase may not produce identical training curves to earlier public runs. Any reproduction attempt should pin to a specific commit and verify against a released checkpoint before sweeping hyperparameters.
| Resource | What it covers |
|---|---|
| Released checkpoints | Released SimPO checkpoints linked from the README |
| Training curves | Loss and reward margin curves for reported runs |
| Environment files | Conda YAML files matching the authors' hardware setup |
| eval/README.md | Benchmark evaluation commands for AlpacaEval 2, MT-Bench, Arena-Hard |
Pro Tip: Match the CUDA and driver version from the repo's environment file before running evals, not just the Python package versions. This is a known source of the reproduction variance the repo warns about. When using TRL's SimPOTrainer, confirm it matches the Princeton NLP trainer's reward computation exactly — particularly how it handles padding token exclusion in the average log-probability calculation.
What the repo provides for reproducibility
The repo releases: model checkpoints (accessible via the link in the README), training curves for the main reported experiments, conda environment files for the authors' hardware configuration, and an evaluation directory with scripts for all three benchmarks. As the README states: "Please find all the released model checkpoints at this link."
Production Note: Checkpoint files are the most reliable reproduction entry point. Evaluate the released checkpoint first to establish a ground-truth score on your infrastructure. Only after confirming alignment between the checkpoint's score on your eval stack and the paper's reported score should you proceed to train from scratch. Discrepancy at the checkpoint-eval stage signals an infrastructure or benchmark-version mismatch, not a training problem.
The evaluation README confirms: "We provide details on the evaluation of the models in this directory." The benchmark scripts cover AlpacaEval 2 (including the judge model call), MT-Bench (GPT-4 judge over 8 categories), and Arena-Hard (pairwise against GPT-4 Turbo baseline).
What the repo does not solve for you
The repo's guidance is honest about its scope. As the README states: "Hyperparameter tuning is crucial for SimPO (and other preference optimization algorithms in general)." What the repo does not provide is a set of universally valid hyperparameters — the released settings were tuned for the authors' specific model families, dataset, and hardware.
Watch Out: Neither SimPO nor DPO ships with hyperparameter defaults that generalize across model families. The SimPO beta values that work for Gemma-2-9B-it will not work for a 1B or 70B model, and may not work for Llama-family models even at the same scale. The repo's tuning guidance is a starting point, not a recipe. Plan for at minimum a 2×3 grid search (two learning rates × three beta values at fixed gamma) before drawing conclusions about whether SimPO is helping on your specific setup.
What practitioners should take away from SimPO
SimPO establishes that the reference model is not a required component of preference optimization — the average log-probability of a sequence, normalized by length and separated by a margin, can substitute both the KL anchor and the relative scoring function that the reference-log-ratio term provided in DPO. The empirical case is credible: the paper reports consistent improvements over DPO and its variants across three alignment benchmarks under various settings, with headline deltas of +6.4 AlpacaEval 2 points and +7.5 Arena-Hard points, and the top Gemma-2-9B-it model achieving a 72.4% length-controlled win rate on AlpacaEval 2. As the paper states: "Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length."
The practitioner's decision is not whether SimPO is theoretically better, but whether the tuning cost of its three-dimensional hyperparameter space (learning_rate, beta, gamma) is worth the potential gain over a well-tuned DPO baseline in their specific lab setup.
| Condition | Recommendation |
|---|---|
| Memory is the bottleneck, team can run sweeps | Test SimPO first |
| No sweep budget, need reliable baseline | Use DPO |
| Unpaired preference data | Use KTO |
| Want SFT + alignment in one stage | Use ORPO |
| Reproducing the paper's exact results | Match CUDA, environment file, and eval version |
Bottom Line: SimPO is a credible upgrade path from DPO for preference optimization on paired datasets when you have the compute budget to tune
betaandgammaproperly. Its reference-free design saves memory and closes the training-generation gap, and the alignment benchmark results are the strongest reported for the <10B class. Do not adopt it as a drop-in replacement without a tuning run — the method's sensitivity to beta (which must be set much higher than DPO's beta) makes default-parameter comparisons meaningless.
A pragmatic adoption rule for alignment runs
Run SimPO if and only if you can evaluate on MT-Bench and Arena-Hard after every hyperparameter configuration, not just at the end of training. MT-Bench's 80-question, 8-category structure catches regressions on multi-turn reasoning that single-turn benchmarks miss. Arena-Hard's 500 technical queries are the strongest proxy for hard instruction-following quality in the <10B regime.
Pro Tip: Use MT-Bench as your fast-feedback signal during the beta/gamma sweep (it runs in minutes on a single GPU with a judge call budget under $5) and Arena-Hard as the final gate before committing to a full training run. If a SimPO configuration beats your DPO baseline on MT-Bench but loses on Arena-Hard, the gamma margin is likely too small — increase it before concluding SimPO underperforms.
Open questions for follow-up experiments
The verified sources establish benchmark superiority within the paper's tested configurations, but several questions remain open:
Watch Out: The SimPO paper does not demonstrate robustness across model scales beyond the tested families (primarily 7B–9B). Scaling behavior — whether the length-normalized reward and the margin term interact differently with 70B-scale models or with smaller 1B models — is not established. Similarly, the benefit of removing the reference model has only been demonstrated on specific hardware (NVIDIA H100, authors' CUDA version). Whether the gradient dynamics reported in the paper hold on A100 clusters or mixed-precision training with bf16 vs. fp16 remains unverified.
Open questions worth running:
- Does SimPO's advantage over DPO hold at 70B scale with LoRA-only fine-tuning, or does the missing KL anchor cause instability at larger parameter counts?
- How sensitive is the Arena-Hard win rate to the choice of (\gamma) across different dataset sources (UltraFeedback vs. Anthropic HH vs. domain-specific preference data)?
- Does the length-normalization advantage diminish on datasets where chosen and rejected responses have similar lengths?
- Can SimPO and DPO be ensembled (e.g., train two checkpoints, merge via model averaging) to capture both the reference anchor stability and the generation-aligned reward?
FAQ
What is SimPO in machine learning?
SimPO (Simple Preference Optimization with a Reference-Free Reward) is a preference optimization algorithm for fine-tuning language models on paired preference data. It replaces the reference model used in DPO with a length-normalized average log-probability reward and a target reward margin, eliminating the need for a second forward pass during training.
How is SimPO different from DPO?
DPO computes implicit rewards as log-ratios between the policy and a frozen reference model: (\beta [\log \pi_\theta(y|x) - \log \pi_{\text{ref}}(y|x)]). SimPO drops the reference model entirely, using (\frac{\beta}{|y|} \log \pi_\theta(y|x)) as the reward and adding a margin term (\gamma) that requires the winning reward to exceed the losing reward by a minimum gap. This removes the memory cost of storing a frozen model and aligns the training reward with how the model actually generates at inference.
Does SimPO outperform DPO on AlpacaEval 2 and MT-Bench?
According to the arXiv paper (2405.14734), yes — SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and reports consistent improvements on MT-Bench under the authors' experimental setup. These results are benchmark- and hardware-specific; the repo explicitly warns that results vary with CUDA version and hyperparameter tuning.
Is SimPO reference-free?
Yes. SimPO does not require a reference model at any stage of training. The reward is computed entirely from the policy being trained.
What are the limitations of SimPO?
SimPO requires careful tuning of three coupled hyperparameters (learning_rate, beta, gamma), and beta must be set much higher than typical DPO values. Results do not transfer automatically across hardware configurations or CUDA versions. The paper does not demonstrate robustness at scales above ~9B parameters or across all preference dataset types. There is no universal hyperparameter default that works across model families.
Sources and references
- SimPO: Simple Preference Optimization with a Reference-Free Reward — Princeton NLP GitHub Repository — Primary source for training objective, benchmark results, hyperparameter guidance, environment files, and released checkpoints
- arXiv:2405.14734 — SimPO paper — The NeurIPS 2024 paper; source for reported benchmark numbers, abstract claims, and methodology
- SimPO eval/README.md — Documents benchmark suite sizes and evaluation commands for AlpacaEval 2, MT-Bench, and Arena-Hard
- TRL (Transformer Reinforcement Learning) — Hugging Face — Library providing SimPOTrainer implementation for training with SimPO
- DPO: Direct Preference Optimization — arXiv:2305.18290 — Original DPO paper, baseline method SimPO is designed to improve upon
Pro Tip: When attempting reproduction, start with the released checkpoints and the repo's conda environment files rather than installing dependencies independently. Confirm your eval numbers match the paper's reported scores for the checkpoint before running any new training — this validates your infrastructure before you attribute any performance gap to SimPO versus your tuning choices. The TRL SimPOTrainer is the most accessible starting point for teams already using the Hugging Face ecosystem.
Keywords: SimPO, DPO, ORPO, KTO, TRL, OpenRLHF, Axolotl, AlpacaEval 2, MT-Bench, Arena-Hard, Llama 3 8B Instruct, Flash Attention 2, NVIDIA H100, CUDA, NeurIPS 2024



