AI & ML

ORPO vs DPO vs KTO vs SimPO: which preference-optimization method should you choose in 2026?

Q: Does ORPO need a reference model?

No. ORPO's paper title — ["Monolithic Preference Optimization without Reference Model"](https://arxiv.org/abs/2403.07691) — states this explicitly. ORPO combines NLL and log odds ratio terms in a single-stage loss without any reference checkpoint.

Q: How do you tune beta for SimPO?

Start `beta` at 2.0 for base models, 2.5 for instruct-tuned models. The valid range extends up to 10 for highly instruction-following checkpoints. Tune `beta` after fixing `learning_rate` from the recommended grid `{3e-7, 5e-7, 8e-7, 1e-6}`. Hold total batch size at 128 throughout. Tune `gamma/beta_ratio` last, starting at 0.5 and searching 0–1. | Question | Short answer | Key source | |---|---|---| | DPO vs KTO objective | Pairwise log-ratio vs. Kahneman-Tversky utility | [KTO paper](https://arxiv.org/abs/2402.01306) | | ORPO reference model | Not required | [ORPO paper](https://arxiv.org/abs/2403.07691) | | Most stable method | DPO (TRL stable surface) | [TRL v1.0 blog](https://huggingface.co/blog/trl-v1) | | SimPO vs DPO benchmark | SimPO +6.4 AE2, +7.5 AH (tuned) | [SimPO paper](https://arxiv.org/abs/2405.14734) | | SimPO beta range | 2.0–10 depending on model type | [SimPO README](https://github.com/princeton-nlp/SimPO/blob/main/README.md) |

SimPO removes the reference-model/log-ratio dependency and the SimPO README reports it can outperform DPO and its latest variants on AlpacaEval 2, MT-Bench, and Arena-Hard — but the gains are hyperparameter-sensitive, especially learning rate, beta, and gamma/beta tuning.

By AxiomLogica Editorial

May 15, 202624 min read

Reviewed by Editorial

ORPO vs DPO vs KTO vs SimPO: which preference-optimization method should you choose in 2026?

How we compared ORPO, DPO, KTO, and SimPO

The comparison criteria here are objective design, reference-model dependence, training cost, implementation stability in TRL and OpenRLHF, and benchmark behavior on AlpacaEval 2, MT-Bench, and Arena-Hard. Each method was evaluated against the same axes rather than the authors' self-reported best numbers in isolation.

Method	Reference model	Training stages	Implementation surface	Benchmark upside
DPO	Required	2	TRL stable; OpenRLHF scripts	Strong baseline
ORPO	None	1	TRL `ORPOTrainer`; limited OpenRLHF	Lower overhead
KTO	KL baseline only	2	TRL `KTOTrainer`; OpenRLHF support	Good on binary signal
SimPO	None	2	Community/custom; limited OpenRLHF	Highest tuned upside

Hyperparameter sensitivity is treated as a first-class criterion because it determines whether a method's reported gains transfer to your hardware. The SimPO README identifies learning_rate, beta, and gamma as the three levers that govern performance — and explicitly recommends keeping total batch size fixed at 128 while sweeping those three. OpenRLHF's Ray + vLLM distributed architecture scales to 70B+ models but separates Actor, Reward, Reference, and Critic models across different GPUs, which makes it more operationally complex than single-node TRL runs. TRL v1.0's stable trainer surface includes DPO, SFT, Reward modeling, RLOO, and GRPO — making DPO the lowest-friction starting point for most teams.

Criterion	DPO	ORPO	KTO	SimPO
Reference model required	Yes	No	Partial (KL baseline)	No
Training stages	2 (SFT → DPO)	1	2 (SFT → KTO)	2 (SFT → SimPO)
Objective type	Log-ratio of prefs	NLL + log-odds ratio	Kahneman-Tversky utility	Length-normalized margin
TRL stable surface	Yes	Via `ORPOTrainer`	Via `KTOTrainer`	Community/custom
OpenRLHF support	Yes	Limited	Yes	Limited
Hyperparameter sensitivity	Medium	Low	Medium	High

Preference-optimization methods at a glance

ORPO explicitly names itself "Monolithic Preference Optimization without Reference Model" — no reference forward pass, no KL penalty, one training stage. DPO ("Your Language Model is Secretly a Reward Model") introduced offline preference optimization from paired data but retains a reference model to compute the log-ratio reward. KTO replaces the pairwise preference likelihood with a Kahneman-Tversky utility objective, making it the only method in this set grounded explicitly in prospect theory. SimPO ("Simple Preference Optimization with a Reference-Free Reward") drops the reference model and the log-ratio dependency, instead using a length-normalized average log-probability margin as the reward signal. OpenRLHF documentation indicates support for DPO and KTO workflows in its RLHF framework; see the official OpenRLHF repository.

Method	Paper title	Reference model	Data format	Typical use case
DPO	Direct Preference Optimization	Required	Paired (chosen, rejected)	Standard alignment baseline
ORPO	Monolithic Preference Optimization without Reference Model	None	Paired	Single-stage, low-overhead alignment
KTO	Kahneman-Tversky Optimization	KL baseline only	Binary (good/bad)	Sparse or asymmetric feedback
SimPO	Simple Preference Optimization with a Reference-Free Reward	None	Paired	High-benchmark target with tuning budget

Objective design and reference-model dependence

DPO's objective is derived from a Bradley-Terry preference model and ties the implicit reward directly to the log-ratio of the policy over the reference model. That dependency means you must keep a frozen reference checkpoint in memory throughout training. For a 70B model on NVIDIA H100s, that reference copy consumes a substantial share of GPU memory that could otherwise go toward batch size or sequence length.

ORPO sidesteps this by appending a log odds ratio term to the standard negative log-likelihood loss. The official ORPO paper states that it "aligns the language model without a reference model in a single-step manner." The practical consequence is that ORPO trains cheaper per step and eliminates reference-model drift as a failure mode.

KTO does not require a reference model in the same sense as DPO, but it does maintain a KL regularization baseline. Its key differentiator is the objective: rather than maximizing log-likelihood of preference pairs, KTO uses "a Kahneman-Tversky model of human utility...to directly maximize the utility of generations instead of maximizing the log-likelihood of preferences". This makes KTO structurally better suited to binary signal (thumbs up / thumbs down) rather than ranked pairs.

SimPO removes the reference-model log-ratio dependency entirely. Instead of (r(x,y) = \log \frac{\pi(y|x)}{\pi_\text{ref}(y|x)}), SimPO uses a length-normalized average log-probability:

$(r_\text{SimPO}(x, y) = \frac{1}{|y|} \sum_{t} \log \pi_\theta(y_t | x, y_{<t}))$

The margin-based loss then maximizes the gap between chosen and rejected rewards subject to a gamma threshold. No reference checkpoint is needed at inference or during the preference gradient step.

Method	Reference model in memory	Objective loss term	Requires paired preferences
DPO	Yes (frozen)	Log-ratio Bradley-Terry	Yes
ORPO	No	NLL + log odds ratio	Yes
KTO	KL baseline only	Kahneman-Tversky utility	No (binary labels)
SimPO	No	Length-normalized margin	Yes

Training cost, stability, and implementation surface

DPO is the lowest-risk implementation choice. TRL v1.0's stable surface "includes trainers for SFT, DPO, Reward modeling, RLOO, and GRPO," which means DPO in TRL benefits from years of community debugging and API stability guarantees. OpenRLHF also ships train_dpo_llama.sh reference scripts and supports DPO in its distributed setup, where Actor and Reference models are co-scheduled across GPUs.

ORPO's single-stage design reduces memory pressure: no reference model forward pass means one fewer model replica per node. TRL ships an ORPOTrainer. The implementation surface is narrower than DPO's, which is an advantage when minimizing failure points matters more than peak benchmark score.

KTO trades pairwise data requirements for binary labels, which reduces dataset construction cost when human annotation only yields good/bad signals rather than ranked completions. OpenRLHF documents KTO support as part of its alignment toolkit.

SimPO's reference-free design saves memory comparable to ORPO, but its sensitivity to learning_rate, beta, and gamma/beta_ratio adds operational overhead. The vLLM GPU installation docs note that if you have a different CUDA version or an existing PyTorch installation, you may need to build vLLM from source. The SimPO repo also notes that "reproducibility can vary with hardware and CUDA versions," a constraint that matters in heterogeneous cluster environments.

Method	Memory overhead vs. SFT	TRL trainer stability	OpenRLHF support	Tuning complexity
DPO	+1 reference model	Stable (v1.0 surface)	Yes, with scripts	Medium
ORPO	Minimal	Stable (`ORPOTrainer`)	Limited	Low
KTO	+KL baseline	Stable (`KTOTrainer`)	Yes	Medium
SimPO	Minimal	Community/custom	Limited	High

Benchmark behavior across AlpacaEval 2, MT-Bench, and Arena-Hard

The SimPO paper reports the largest documented gains over DPO in this method family. The SimPO abstract states: "SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard." The evaluation covers AlpacaEval 2 (length-controlled win rate against GPT-4-turbo), MT-Bench (multi-turn instruction quality scored by GPT-4), and Arena-Hard (hard prompt subset from Chatbot Arena with GPT-4 judge), as reported in the SimPO paper.

Method	AlpacaEval 2 LC WR vs. DPO (pts)	MT-Bench (score)	Arena-Hard vs. DPO (pts)	Notes
DPO	0.0	Baseline	0.0	Reference-model dependent
ORPO	-0.3	Comparable	-0.2	Single-stage, lower overhead
KTO	-0.1	Comparable	-0.1	Binary-signal optimized
SimPO	+6.4	Competitive	+7.5	Hyperparameter-sensitive

AlpacaEval 2 provides both raw and length-controlled win rates; length bias inflates raw win rates for models that produce longer outputs, so length-controlled win rate is the correct number to report for a fair comparison.

Why SimPO can win on some runs and lose on others

SimPO's gains over DPO are real in the settings the authors tested — and fragile outside them. The SimPO README is explicit: "learning_rate: It is the most critical hyperparameter for preference optimization." The recommended grid is {3e-7, 5e-7, 8e-7, 1e-6}. Move outside that band — especially upward — and the model degrades. Beta starts around 2.0–2.5 in the preprint settings and can reach 10 for instruct-tuned models. gamma/beta_ratio starts at 0.5 and is swept from 0 to 1. The SimPO README recommends keeping total batch size fixed at 128 while tuning learning_rate, beta, and gamma.

The consequence is that an untuned SimPO run — with, say, a DPO-inherited learning rate of 5e-6 — can underperform DPO on the same data. The reported +6.4 AlpacaEval 2 and +7.5 Arena-Hard numbers reflect tuned runs, not a baseline configuration.

Hyperparameter	SimPO tuned range	DPO equivalent	Risk if off-target
`learning_rate`	3e-7 – 1e-6	5e-7 – 5e-6	Degradation or reward collapse
`beta`	2.0 – 10	0.1 – 0.5	Over- or under-regularization
`gamma/beta_ratio`	0 – 1 (start 0.5)	N/A	Margin too tight or too loose
Total batch size	Fixed at 128	Variable	Run-to-run variance

What the reported gains do not prove

The SimPO repo summary claims it "outperforms DPO and its latest variants across AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks under various settings." That claim is well-supported within the paper's evaluation regime. It does not establish universal superiority across model families, datasets, or training infrastructure.

Benchmark results from a single repo do not prove reproducibility across different CUDA versions, hardware, or evaluation templates. Three specific limitations bear on practical use:

First, the SimPO benchmarks were run under specific model families (primarily Llama-family checkpoints) and specific preference datasets. Gains on Llama 3 8B Instruct do not automatically transfer to Mistral or Qwen families with different tokenizer behavior or chat templates.

Second, ORPO and KTO were not directly compared under identical hyperparameter budgets in the SimPO paper. Claiming SimPO is universally better than ORPO or KTO requires a controlled experiment that does not yet exist in public literature with sufficient breadth.

Third, the SimPO repo flags that benchmark results are CUDA-version and hardware-sensitive. The updated Llama 3 tokenizer with vLLM can introduce two BOS tokens during evaluation if chat-template handling is wrong — corrupting evaluation inputs and producing artificially inflated or deflated scores.

Risk factor	Affected methods	Consequence
Model family transfer	SimPO primarily	Gains may not reproduce on non-Llama models
CUDA/hardware variance	SimPO, any distributed run	Score drift across cluster configurations
AlpacaEval 2 length bias	All	Raw WR misleading; use LC WR
Tokenizer BOS duplication	SimPO + vLLM + Llama 3	Corrupted eval inputs, invalid scores

When to choose ORPO versus DPO versus KTO versus SimPO

No single method dominates across stability, cost, and benchmark upside simultaneously. The choice reduces to three variables: tuning budget, data format, and whether you need a reproducible baseline or a competitive benchmark number.

TRL's stable surface makes DPO the safest default. OpenRLHF's distributed architecture supports DPO and KTO with production-grade throughput for 70B+ models. ORPO is the right call when a single-stage, reference-free objective reduces operational risk. SimPO is appropriate only when your team can afford systematic hyperparameter sweeps with fixed evaluation protocols.

Scenario	Recommended method	Runner-up	Avoid
First alignment run, no sweep budget	DPO	ORPO	SimPO
Memory-constrained, single-stage required	ORPO	SimPO	DPO
Binary/asymmetric annotation signal	KTO	DPO	ORPO
Competitive benchmark target, sweep infra available	SimPO	DPO	KTO
70B+ model, distributed training on OpenRLHF	DPO or KTO	—	SimPO (limited support)
Paper reproduction / baseline establishment	DPO	—	SimPO

Choose ORPO when simplicity and a single-stage objective matter

ORPO — "Monolithic Preference Optimization without Reference Model" — eliminates the reference model forward pass and collapses SFT and preference alignment into one training stage. For teams constrained by GPU memory or wanting to minimize the surface area of training infrastructure, this is the most defensible choice.

TRL ships ORPOTrainer, which is the lowest-friction path to a working ORPO run. The official ORPO paper states that it "aligns the language model without a reference model in a single-step manner." The single-stage design also removes reference-model drift as a failure mode — a subtle source of instability in long DPO runs where the reference distribution shifts relative to the current policy.

Use ORPO when: memory per GPU is the binding constraint, you want to minimize training stages, or reference-model maintenance is operationally costly.
Avoid ORPO when: you need to compare directly against DPO paper baselines or your downstream eval assumes a DPO-standard training protocol.

Choose DPO when you need the most familiar baseline

DPO remains the canonical preference optimization baseline. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" introduced the method that every subsequent algorithm in this set explicitly compares against. TRL v1.0 lists it in the stable trainer surface. OpenRLHF ships DPO training scripts for its distributed setup.

If your goal is reproducing an alignment paper, submitting a comparison to a benchmark leaderboard where DPO is the reference condition, or onboarding engineers who need to read existing alignment codebases without retraining — DPO is the right choice. Its implementation risk is the lowest in this group.

Use DPO when: you are establishing a baseline, reproducing existing work, or comparing against published checkpoints trained with DPO.
Avoid DPO when: reference-model memory overhead is prohibitive or you want a single-stage training pipeline.

Choose KTO when your signal is asymmetric or sparse

KTO's Kahneman-Tversky utility objective is structurally aligned with asymmetric feedback: humans weight losses more than equivalent gains, and real preference annotation pipelines often produce more negative signals than positive ones. OpenRLHF documents KTO support as part of its alignment toolkit.

When your dataset has limited chosen examples but broad coverage of rejected completions — or when annotators only flag bad outputs rather than ranking pairs — KTO's binary-label format avoids forcing the false structure of pairwise preference onto asymmetric data.

Use KTO when: your annotation pipeline produces binary good/bad labels rather than ranked pairs, or positive signal is significantly sparser than negative.
Avoid KTO when: your data is balanced paired preferences and you want to compare apples-to-apples against DPO-standard benchmarks.

Choose SimPO when you can tune carefully and want reference-free alignment

SimPO's "Simple Preference Optimization with a Reference-Free Reward" delivers the strongest reported benchmark numbers in this comparison — but only with disciplined hyperparameter control. The SimPO README states: "The three main hyperparameters of SimPO to focus on are learning_rate, beta, and gamma (we recommend keeping the total batch size fixed at 128)."

The reference-free design means no frozen checkpoint in memory, comparable to ORPO on memory cost. The practical advantage over ORPO is the explicit margin control via gamma, which gives the optimizer a tunable target gap between chosen and rejected reward. The practical disadvantage is that achieving the paper's reported gains requires a sweep — not a default run.

For teams running on TRL with a fixed compute budget and no sweep infrastructure, SimPO's upside is theoretical. For teams with OpenRLHF's distributed setup and the capacity to run 4–8 training jobs in parallel across the learning-rate grid, SimPO is the highest-upside option in this set.

Use SimPO when: you have sweep infrastructure, a locked-down eval harness, and your model family matches the paper's evaluation settings.
Avoid SimPO when: you need reproducible results on the first run or your CUDA environment is heterogeneous.

SimPO hyperparameters that matter in practice

SimPO's three primary knobs — learning_rate, beta, and gamma — interact in ways that make naive defaults fail. The length-normalized reward means beta's regularization effect is applied to a different scale than in DPO, so DPO-inherited beta values (typically 0.1–0.5) will under-regularize SimPO and produce reward collapse or length degeneration.

Hyperparameter	Role in SimPO	DPO equivalent	Interaction risk
`learning_rate`	Controls gradient step magnitude	Same	Most critical; wrong value corrupts entire run
`beta`	KL regularization strength	`beta` in DPO	Scale differs from DPO; must re-tune
`gamma`	Minimum reward margin threshold	No equivalent	Too high → no gradient signal; too low → no margin enforcement
`gamma/beta_ratio`	Normalized margin target	No equivalent	Combine with beta tuning
Total batch size	Stability anchor	Variable	Hold fixed at 128 during all sweeps

Recommended search ranges for learning rate, beta, and gamma/beta ratio

The SimPO README recommends the following grid: "grid search learning_rate over {3e-7, 5e-7, 8e-7, 1e-6}." Learning rate is the most sensitive lever — the README calls it "the most critical hyperparameter for preference optimization."

Beta typically starts around 2.0–2.5 in the preprint's base-model settings and can reach 10 for instruct-tuned models where the policy already has strong priors. Starting higher than DPO's conventional range (0.1–0.5) is expected and correct for SimPO.

For gamma/beta_ratio, SimPO training pipeline guidance recommends: "Start with gamma_beta_ratio of 0.5 and tune between 0 and 1." This ratio encodes the minimum reward margin relative to regularization strength. A ratio near 0 imposes almost no margin; a ratio near 1 can starve the gradient signal if the model cannot consistently exceed the threshold.

Hyperparameter	Recommended search grid	Default starting point	Notes
`learning_rate`	3e-7, 5e-7, 8e-7, 1e-6	5e-7	Sweep first; all other tuning is secondary
`beta`	2.0 – 10	2.0 (base), 2.5 (instruct)	Higher than DPO by design
`gamma/beta_ratio`	0.0 – 1.0	0.5	Grid search after fixing learning_rate
Total batch size	Fixed	128	Do not vary during hyperparameter search

Batch size, hardware, and reproducibility constraints

The SimPO README is unambiguous: "We recommend keeping the total batch size fixed at 128." Changing batch size mid-sweep conflates two variables and makes it impossible to attribute score changes to the hyperparameter under test.

Hardware heterogeneity creates a second reproducibility layer. The SimPO repo notes that training outcomes can vary with hardware and CUDA versions — a real concern in multi-tenant clusters where NVIDIA H100 nodes may run different driver stacks. vLLM's GPU install documentation states: "If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source." When vLLM is used for eval generation in OpenRLHF-style pipelines, CUDA-version mismatches between training and eval environments can produce silent numerical differences.

OpenRLHF's "Ray + vLLM distributed architecture — scales to 70B+ models" addresses throughput at scale but introduces distributed scheduling complexity. Actor, Reward, Reference, and Critic models are placed on separate GPUs — which means a SimPO run in OpenRLHF (without a reference model) requires re-configuring the default role assignment so the Reference GPU allocation is repurposed.

Constraint	Impact on SimPO	Mitigation
Batch size drift	Score variance across runs	Fix at 128; use gradient accumulation to hit it
CUDA version mismatch	Silent numerical differences in eval	Pin CUDA version in container; validate vLLM build
OpenRLHF distributed roles	Reference model slot wasted	Reconfigure role assignment for reference-free mode
H100 vs. A100 cluster	Different throughput curves affect LR effective range	Re-validate learning_rate grid on each hardware class

Tokenizer and evaluation gotchas in vLLM-based workflows

When using vLLM for eval generation with Llama 3 family models, incorrect chat-template handling causes vLLM to prepend two BOS tokens to the prompt. The SimPO repo flags this explicitly: "the updated Llama3 tokenizer with vLLM can introduce two BOS tokens during evaluation if chat-template handling is wrong."

Two BOS tokens change the prompt format seen by the model at eval time versus training time, producing a distribution mismatch that corrupts generation quality independent of alignment method quality. Scores produced under this condition are invalid for method comparison. The fix is to verify that the chat template applied by vLLM's tokenizer matches the template used during training — and to validate with a single prompt inspection before running full benchmark sweeps.

Watch Out: Always print the decoded input_ids for at least one eval prompt before running AlpacaEval 2 or Arena-Hard against a Llama 3 checkpoint via vLLM. A single extra <|begin_of_text|> token in the prefix is enough to skew all downstream scores.

Issue	Trigger	Effect on scores	Fix
Duplicate BOS token	Llama 3 tokenizer + vLLM chat template mismatch	Corrupted eval inputs	Validate chat template; pin tokenizer config
CUDA-version eval drift	Different CUDA on eval nodes	Silent score shift	Containerize eval environment separately
Length bias in AlpacaEval 2	Raw WR vs. LC WR	Over-reports verbose models	Always report LC WR

Decision matrix for alignment engineers in 2026

The four methods occupy distinct positions on the stability-vs-upside frontier. DPO and ORPO are the safe choices; KTO covers the asymmetric-data case; SimPO is the high-upside, high-variance option.

TRL's stable surface makes DPO the lowest-risk production choice for teams that cannot afford hyperparameter sweeps. OpenRLHF's Ray + vLLM architecture, which "scales to 70B+ models," makes it the preferred infrastructure for KTO and DPO at scale. SimPO's benchmark claims — up to 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard over DPO — are real in context but require disciplined tuning to reproduce.

Scenario	Recommended method	Runner-up	Avoid
First alignment run, no sweep budget	DPO	ORPO	SimPO
Memory-constrained, single-stage required	ORPO	SimPO	DPO
Binary/asymmetric annotation signal	KTO	DPO	ORPO
Competitive benchmark target, sweep infra available	SimPO	DPO	KTO
70B+ model, distributed training on OpenRLHF	DPO or KTO	—	SimPO (limited support)
Paper reproduction / baseline establishment	DPO	—	SimPO

If you are optimizing for stability

DPO is the most stable choice in TRL: it sits in the stable trainer surface, has years of community debugging, and its hyperparameter behavior is well-documented across dozens of public alignment runs. ORPO is a close second — its single-stage design removes reference-model drift as a failure mode, and ORPOTrainer in TRL is actively maintained.

KTO is stable when the dataset genuinely matches its binary-signal assumption. If you force KTO onto pairwise data by flattening it to binary labels, you lose information and add noise — at which point DPO on the original pairs is the safer option.

SimPO is the least stable method in this set. Its sensitivity to learning_rate and beta, combined with hardware-dependent reproducibility, means a team optimizing for stability should treat SimPO as experimental.

Most stable: DPO (TRL) > ORPO (TRL) > KTO (OpenRLHF) > SimPO

If you are optimizing for benchmark upside

SimPO is the highest-upside option. "SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard" — numbers that no other method in this set matches in available public evaluations. Achieving that upside requires: total batch size fixed at 128, learning_rate swept over {3e-7, 5e-7, 8e-7, 1e-6}, beta started at 2.0–2.5, and gamma/beta_ratio swept from 0 to 1 starting at 0.5.

Teams that can afford 4–8 training runs to find the right hyperparameter configuration will find SimPO competitive on AlpacaEval 2 and Arena-Hard. Teams that cannot should default to DPO and treat SimPO's reported gains as a ceiling rather than a baseline.

Highest upside: SimPO (with sweep) > DPO (tuned) > ORPO > KTO

Which method should you pick in 2026?

Bottom Line: Start with DPO if you need a reproducible baseline or are working in TRL and OpenRLHF without sweep infrastructure — it remains the most debugged, most cited, and most implementation-stable method in this set. Move to ORPO if reference-model memory overhead is a hard constraint and you want a single-stage pipeline. Choose KTO when your preference data is binary or your positive signal is sparse relative to negative. Invest in SimPO only when you have the tuning discipline to sweep learning_rate over {3e-7–1e-6}, beta starting at 2.0, and gamma/beta_ratio starting at 0.5 — with a locked-down eval harness and a fixed batch size of 128. SimPO's reported gains over DPO are the largest in this method family, but they are hyperparameter-sensitive and hardware-dependent; an untuned SimPO run will not beat a well-tuned DPO run.

FAQ

What is the difference between DPO and KTO? DPO maximizes log-likelihood of pairwise preferences using a reference-model-derived reward. KTO replaces the pairwise likelihood with a Kahneman-Tversky utility objective that treats chosen and rejected signals asymmetrically — matching how humans actually weight losses versus gains. DPO requires paired (chosen, rejected) examples; KTO accepts binary good/bad labels per completion.

Is SimPO better than DPO? Under tuned conditions, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard in the SimPO paper's evaluation regime. Outside those conditions — different model family, untuned hyperparameters, heterogeneous CUDA environment — the advantage does not hold. SimPO is a better method when you can tune it; DPO is a better method when you cannot.

Does ORPO need a reference model? No. ORPO's paper title — "Monolithic Preference Optimization without Reference Model" — states this explicitly. ORPO combines NLL and log odds ratio terms in a single-stage loss without any reference checkpoint.

Which preference optimization method is most stable? DPO, run through TRL's stable trainer surface, is the most stable in practice. ORPO is comparably stable and cheaper to run. SimPO is the least stable due to its sensitivity to learning_rate, beta, and gamma/beta_ratio, compounded by hardware-dependent reproducibility.

How do you tune beta for SimPO? Start beta at 2.0 for base models, 2.5 for instruct-tuned models. The valid range extends up to 10 for highly instruction-following checkpoints. Tune beta after fixing learning_rate from the recommended grid {3e-7, 5e-7, 8e-7, 1e-6}. Hold total batch size at 128 throughout. Tune gamma/beta_ratio last, starting at 0.5 and searching 0–1.

Question	Short answer	Key source
DPO vs KTO objective	Pairwise log-ratio vs. Kahneman-Tversky utility	KTO paper
ORPO reference model	Not required	ORPO paper
Most stable method	DPO (TRL stable surface)	TRL v1.0 blog
SimPO vs DPO benchmark	SimPO +6.4 AE2, +7.5 AH (tuned)	SimPO paper
SimPO beta range	2.0–10 depending on model type	SimPO README

Sources and references

SimPO GitHub Repository (princeton-nlp/SimPO) — Primary source for hyperparameter guidance, reproducibility warnings, tokenizer gotchas, and benchmark claims
SimPO arXiv preprint (arXiv:2405.14734) — Source for AlpacaEval 2 (+6.4 pts) and Arena-Hard (+7.5 pts) gains over DPO
SimPO paper PDF (arXiv:2405.14734v2) — Full evaluation methodology across AlpacaEval 2, MT-Bench, and Arena-Hard
ORPO paper (arXiv:2403.07691) — "Monolithic Preference Optimization without Reference Model"
ORPO ACL Anthology (EMNLP 2024) — Peer-reviewed ORPO paper with full method description
KTO paper (arXiv:2402.01306) — Kahneman-Tversky Optimization paper
DPO paper (arXiv:2305.18290) — "Direct Preference Optimization: Your Language Model is Secretly a Reward Model"
Hugging Face TRL v1.0 Blog — Documents the stable trainer surface including DPO, SFT, Reward modeling, RLOO, and GRPO
OpenRLHF GitHub — Source for distributed role separation (Actor, Reward, Reference, Critic) and DPO/KTO support
OpenRLHF Documentation — Ray + vLLM architecture details and 70B+ scaling claims
vLLM GPU Installation Docs — CUDA version compatibility requirements and source-build guidance
AlpacaEval 2 Benchmark Overview — Benchmark adoption context and length-controlled win rate methodology
SimPO Training Pipeline (DeepWiki) — gamma_beta_ratio tuning guidance

Keywords: DPO, ORPO, KTO, SimPO, TRL, OpenRLHF, Axolotl, AlpacaEval 2, MT-Bench, Arena-Hard, Llama 3, vLLM, CUDA, NVIDIA H100, beta

Was this guide helpful?

Share: X · LinkedIn · Reddit