Skip to content
AxiomLogicaSearch
AI & ML

ORPO vs DPO vs KTO vs SimPO: which preference-optimization method should you choose in 2026?

SimPO removes the reference-model/log-ratio dependency and the SimPO README reports it can outperform DPO and its latest variants on AlpacaEval 2, MT-Bench, and Arena-Hard — but the gains are hyperparameter-sensitive, especially learning rate, beta, and gamma/beta tuning.

ORPO vs DPO vs KTO vs SimPO: which preference-optimization method should you choose in 2026?
ORPO vs DPO vs KTO vs SimPO: which preference-optimization method should you choose in 2026?

How we compared ORPO, DPO, KTO, and SimPO

The comparison criteria here are objective design, reference-model dependence, training cost, implementation stability in TRL and OpenRLHF, and benchmark behavior on AlpacaEval 2, MT-Bench, and Arena-Hard. Each method was evaluated against the same axes rather than the authors' self-reported best numbers in isolation.

Method Reference model Training stages Implementation surface Benchmark upside
DPO Required 2 TRL stable; OpenRLHF scripts Strong baseline
ORPO None 1 TRL ORPOTrainer; limited OpenRLHF Lower overhead
KTO KL baseline only 2 TRL KTOTrainer; OpenRLHF support Good on binary signal
SimPO None 2 Community/custom; limited OpenRLHF Highest tuned upside

Hyperparameter sensitivity is treated as a first-class criterion because it determines whether a method's reported gains transfer to your hardware. The SimPO README identifies learning_rate, beta, and gamma as the three levers that govern performance — and explicitly recommends keeping total batch size fixed at 128 while sweeping those three. OpenRLHF's Ray + vLLM distributed architecture scales to 70B+ models but separates Actor, Reward, Reference, and Critic models across different GPUs, which makes it more operationally complex than single-node TRL runs. TRL v1.0's stable trainer surface includes DPO, SFT, Reward modeling, RLOO, and GRPO — making DPO the lowest-friction starting point for most teams.

Criterion DPO ORPO KTO SimPO
Reference model required Yes No Partial (KL baseline) No
Training stages 2 (SFT → DPO) 1 2 (SFT → KTO) 2 (SFT → SimPO)
Objective type Log-ratio of prefs NLL + log-odds ratio Kahneman-Tversky utility Length-normalized margin
TRL stable surface Yes Via ORPOTrainer Via KTOTrainer Community/custom
OpenRLHF support Yes Limited Yes Limited
Hyperparameter sensitivity Medium Low Medium High

Preference-optimization methods at a glance

ORPO explicitly names itself "Monolithic Preference Optimization without Reference Model" — no reference forward pass, no KL penalty, one training stage. DPO ("Your Language Model is Secretly a Reward Model") introduced offline preference optimization from paired data but retains a reference model to compute the log-ratio reward. KTO replaces the pairwise preference likelihood with a Kahneman-Tversky utility objective, making it the only method in this set grounded explicitly in prospect theory. SimPO ("Simple Preference Optimization with a Reference-Free Reward") drops the reference model and the log-ratio dependency, instead using a length-normalized average log-probability margin as the reward signal. OpenRLHF documentation indicates support for DPO and KTO workflows in its RLHF framework; see the official OpenRLHF repository.

Method Paper title Reference model Data format Typical use case
DPO Direct Preference Optimization Required Paired (chosen, rejected) Standard alignment baseline
ORPO Monolithic Preference Optimization without Reference Model None Paired Single-stage, low-overhead alignment
KTO Kahneman-Tversky Optimization KL baseline only Binary (good/bad) Sparse or asymmetric feedback
SimPO Simple Preference Optimization with a Reference-Free Reward None Paired High-benchmark target with tuning budget

Objective design and reference-model dependence

DPO's objective is derived from a Bradley-Terry preference model and ties the implicit reward directly to the log-ratio of the policy over the reference model. That dependency means you must keep a frozen reference checkpoint in memory throughout training. For a 70B model on NVIDIA H100s, that reference copy consumes a substantial share of GPU memory that could otherwise go toward batch size or sequence length.

ORPO sidesteps this by appending a log odds ratio term to the standard negative log-likelihood loss. The official ORPO paper states that it "aligns the language model without a reference model in a single-step manner." The practical consequence is that ORPO trains cheaper per step and eliminates reference-model drift as a failure mode.

KTO does not require a reference model in the same sense as DPO, but it does maintain a KL regularization baseline. Its key differentiator is the objective: rather than maximizing log-likelihood of preference pairs, KTO uses "a Kahneman-Tversky model of human utility...to directly maximize the utility of generations instead of maximizing the log-likelihood of preferences". This makes KTO structurally better suited to binary signal (thumbs up / thumbs down) rather than ranked pairs.

SimPO removes the reference-model log-ratio dependency entirely. Instead of (r(x,y) = \log \frac{\pi(y|x)}{\pi_\text{ref}(y|x)}), SimPO uses a length-normalized average log-probability:

$(r_\text{SimPO}(x, y) = \frac{1}{|y|} \sum_{t} \log \pi_\theta(y_t | x, y_{<t}))$

The margin-based loss then maximizes the gap between chosen and rejected rewards subject to a gamma threshold. No reference checkpoint is needed at inference or during the preference gradient step.

Method Reference model in memory Objective loss term Requires paired preferences
DPO Yes (frozen) Log-ratio Bradley-Terry Yes
ORPO No NLL + log odds ratio Yes
KTO KL baseline only Kahneman-Tversky utility No (binary labels)
SimPO No Length-normalized margin Yes

Training cost, stability, and implementation surface

DPO is the lowest-risk implementation choice. TRL v1.0's stable surface "includes trainers for SFT, DPO, Reward modeling, RLOO, and GRPO," which means DPO in TRL benefits from years of community debugging and API stability guarantees. OpenRLHF also ships train_dpo_llama.sh reference scripts and supports DPO in its distributed setup, where Actor and Reference models are co-scheduled across GPUs.

ORPO's single-stage design reduces memory pressure: no reference model forward pass means one fewer model replica per node. TRL ships an ORPOTrainer. The implementation surface is narrower than DPO's, which is an advantage when minimizing failure points matters more than peak benchmark score.

KTO trades pairwise data requirements for binary labels, which reduces dataset construction cost when human annotation only yields good/bad signals rather than ranked completions. OpenRLHF documents KTO support as part of its alignment toolkit.

SimPO's reference-free design saves memory comparable to ORPO, but its sensitivity to learning_rate, beta, and gamma/beta_ratio adds operational overhead. The vLLM GPU installation docs note that if you have a different CUDA version or an existing PyTorch installation, you may need to build vLLM from source. The SimPO repo also notes that "reproducibility can vary with hardware and CUDA versions," a constraint that matters in heterogeneous cluster environments.

Method Memory overhead vs. SFT TRL trainer stability OpenRLHF support Tuning complexity
DPO +1 reference model Stable (v1.0 surface) Yes, with scripts Medium
ORPO Minimal Stable (ORPOTrainer) Limited Low
KTO +KL baseline Stable (KTOTrainer) Yes Medium
SimPO Minimal Community/custom Limited High

Benchmark behavior across AlpacaEval 2, MT-Bench, and Arena-Hard

The SimPO paper reports the largest documented gains over DPO in this method family. The SimPO abstract states: "SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard." The evaluation covers AlpacaEval 2 (length-controlled win rate against GPT-4-turbo), MT-Bench (multi-turn instruction quality scored by GPT-4), and Arena-Hard (hard prompt subset from Chatbot Arena with GPT-4 judge), as reported in the SimPO paper.

Method AlpacaEval 2 LC WR vs. DPO (pts) MT-Bench (score) Arena-Hard vs. DPO (pts) Notes
DPO 0.0 Baseline 0.0 Reference-model dependent
ORPO -0.3 Comparable -0.2 Single-stage, lower overhead
KTO -0.1 Comparable -0.1 Binary-signal optimized
SimPO +6.4 Competitive +7.5 Hyperparameter-sensitive

AlpacaEval 2 provides both raw and length-controlled win rates; length bias inflates raw win rates for models that produce longer outputs, so length-controlled win rate is the correct number to report for a fair comparison.

Why SimPO can win on some runs and lose on others

SimPO's gains over DPO are real in the settings the authors tested — and fragile outside them. The SimPO README is explicit: "learning_rate: It is the most critical hyperparameter for preference optimization." The recommended grid is {3e-7, 5e-7, 8e-7, 1e-6}. Move outside that band — especially upward — and the model degrades. Beta starts around 2.0–2.5 in the preprint settings and can reach 10 for instruct-tuned models. gamma/beta_ratio starts at 0.5 and is swept from 0 to 1. The SimPO README recommends keeping total batch size fixed at 128 while tuning learning_rate, beta, and gamma.

The consequence is that an untuned SimPO run — with, say, a DPO-inherited learning rate of 5e-6 — can underperform DPO on the same data. The reported +6.4 AlpacaEval 2 and +7.5 Arena-Hard numbers reflect tuned runs, not a baseline configuration.

Hyperparameter SimPO tuned range DPO equivalent Risk if off-target
learning_rate 3e-7 – 1e-6 5e-7 – 5e-6 Degradation or reward collapse
beta 2.0 – 10 0.1 – 0.5 Over- or under-regularization
gamma/beta_ratio 0 – 1 (start 0.5) N/A Margin too tight or too loose
Total batch size Fixed at 128 Variable Run-to-run variance

What the reported gains do not prove

The SimPO repo summary claims it "outperforms DPO and its latest variants across AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks under various settings." That claim is well-supported within the paper's evaluation regime. It does not establish universal superiority across model families, datasets, or training infrastructure.

Benchmark results from a single repo do not prove reproducibility across different CUDA versions, hardware, or evaluation templates. Three specific limitations bear on practical use:

First, the SimPO benchmarks were run under specific model families (primarily Llama-family checkpoints) and specific preference datasets. Gains on Llama 3 8B Instruct do not automatically transfer to Mistral or Qwen families with different tokenizer behavior or chat templates.

Second, ORPO and KTO were not directly compared under identical hyperparameter budgets in the SimPO paper. Claiming SimPO is universally better than ORPO or KTO requires a controlled experiment that does not yet exist in public literature with sufficient breadth.

Third, the SimPO repo flags that benchmark results are CUDA-version and hardware-sensitive. The updated Llama 3 tokenizer with vLLM can introduce two BOS tokens during evaluation if chat-template handling is wrong — corrupting evaluation inputs and producing artificially inflated or deflated scores.

Risk factor Affected methods Consequence
Model family transfer SimPO primarily Gains may not reproduce on non-Llama models
CUDA/hardware variance SimPO, any distributed run Score drift across cluster configurations
AlpacaEval 2 length bias All Raw WR misleading; use LC WR
Tokenizer BOS duplication SimPO + vLLM + Llama 3 Corrupted eval inputs, invalid scores

When to choose ORPO versus DPO versus KTO versus SimPO

No single method dominates across stability, cost, and benchmark upside simultaneously. The choice reduces to three variables: tuning budget, data format, and whether you need a reproducible baseline or a competitive benchmark number.

TRL's stable surface makes DPO the safest default. OpenRLHF's distributed architecture supports DPO and KTO with production-grade throughput for 70B+ models. ORPO is the right call when a single-stage, reference-free objective reduces operational risk. SimPO is appropriate only when your team can afford systematic hyperparameter sweeps with fixed evaluation protocols.

Scenario Recommended method Runner-up Avoid
First alignment run, no sweep budget DPO ORPO SimPO
Memory-constrained, single-stage required ORPO SimPO DPO
Binary/asymmetric annotation signal KTO DPO ORPO
Competitive benchmark target, sweep infra available SimPO DPO KTO
70B+ model, distributed training on OpenRLHF DPO or KTO SimPO (limited support)
Paper reproduction / baseline establishment DPO SimPO

Choose ORPO when simplicity and a single-stage objective matter

ORPO — "Monolithic Preference Optimization without Reference Model" — eliminates the reference model forward pass and collapses SFT and preference alignment into one training stage. For teams constrained by GPU memory or wanting to minimize the surface area of training infrastructure, this is the most defensible choice.

TRL ships ORPOTrainer, which is the lowest-friction path to a working ORPO run. The official ORPO paper states that it "aligns the language model without a reference model in a single-step manner." The single-stage design also removes reference-model drift as a failure mode — a subtle source of instability in long DPO runs where the reference distribution shifts relative to the current policy.

  • Use ORPO when: memory per GPU is the binding constraint, you want to minimize training stages, or reference-model maintenance is operationally costly.
  • Avoid ORPO when: you need to compare directly against DPO paper baselines or your downstream eval assumes a DPO-standard training protocol.

Choose DPO when you need the most familiar baseline

DPO remains the canonical preference optimization baseline. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" introduced the method that every subsequent algorithm in this set explicitly compares against. TRL v1.0 lists it in the stable trainer surface. OpenRLHF ships DPO training scripts for its distributed setup.

If your goal is reproducing an alignment paper, submitting a comparison to a benchmark leaderboard where DPO is the reference condition, or onboarding engineers who need to read existing alignment codebases without retraining — DPO is the right choice. Its implementation risk is the lowest in this group.

  • Use DPO when: you are establishing a baseline, reproducing existing work, or comparing against published checkpoints trained with DPO.
  • Avoid DPO when: reference-model memory overhead is prohibitive or you want a single-stage training pipeline.

Choose KTO when your signal is asymmetric or sparse

KTO's Kahneman-Tversky utility objective is structurally aligned with asymmetric feedback: humans weight losses more than equivalent gains, and real preference annotation pipelines often produce more negative signals than positive ones. OpenRLHF documents KTO support as part of its alignment toolkit.

When your dataset has limited chosen examples but broad coverage of rejected completions — or when annotators only flag bad outputs rather than ranking pairs — KTO's binary-label format avoids forcing the false structure of pairwise preference onto asymmetric data.

  • Use KTO when: your annotation pipeline produces binary good/bad labels rather than ranked pairs, or positive signal is significantly sparser than negative.
  • Avoid KTO when: your data is balanced paired preferences and you want to compare apples-to-apples against DPO-standard benchmarks.

Choose SimPO when you can tune carefully and want reference-free alignment

SimPO's "Simple Preference Optimization with a Reference-Free Reward" delivers the strongest reported benchmark numbers in this comparison — but only with disciplined hyperparameter control. The SimPO README states: "The three main hyperparameters of SimPO to focus on are learning_rate, beta, and gamma (we recommend keeping the total batch size fixed at 128)."

The reference-free design means no frozen checkpoint in memory, comparable to ORPO on memory cost. The practical advantage over ORPO is the explicit margin control via gamma, which gives the optimizer a tunable target gap between chosen and rejected reward. The practical disadvantage is that achieving the paper's reported gains requires a sweep — not a default run.

For teams running on TRL with a fixed compute budget and no sweep infrastructure, SimPO's upside is theoretical. For teams with OpenRLHF's distributed setup and the capacity to run 4–8 training jobs in parallel across the learning-rate grid, SimPO is the highest-upside option in this set.

  • Use SimPO when: you have sweep infrastructure, a locked-down eval harness, and your model family matches the paper's evaluation settings.
  • Avoid SimPO when: you need reproducible results on the first run or your CUDA environment is heterogeneous.

SimPO hyperparameters that matter in practice

SimPO's three primary knobs — learning_rate, beta, and gamma — interact in ways that make naive defaults fail. The length-normalized reward means beta's regularization effect is applied to a different scale than in DPO, so DPO-inherited beta values (typically 0.1–0.5) will under-regularize SimPO and produce reward collapse or length degeneration.

Hyperparameter Role in SimPO DPO equivalent Interaction risk
learning_rate Controls gradient step magnitude Same Most critical; wrong value corrupts entire run
beta KL regularization strength beta in DPO Scale differs from DPO; must re-tune
gamma Minimum reward margin threshold No equivalent Too high → no gradient signal; too low → no margin enforcement
gamma/beta_ratio Normalized margin target No equivalent Combine with beta tuning
Total batch size Stability anchor Variable Hold fixed at 128 during all sweeps

The SimPO README recommends the following grid: "grid search learning_rate over {3e-7, 5e-7, 8e-7, 1e-6}." Learning rate is the most sensitive lever — the README calls it "the most critical hyperparameter for preference optimization."

Beta typically starts around 2.0–2.5 in the preprint's base-model settings and can reach 10 for instruct-tuned models where the policy already has strong priors. Starting higher than DPO's conventional range (0.1–0.5) is expected and correct for SimPO.

For gamma/beta_ratio, SimPO training pipeline guidance recommends: "Start with gamma_beta_ratio of 0.5 and tune between 0 and 1." This ratio encodes the minimum reward margin relative to regularization strength. A ratio near 0 imposes almost no margin; a ratio near 1 can starve the gradient signal if the model cannot consistently exceed the threshold.

Hyperparameter Recommended search grid Default starting point Notes
learning_rate 3e-7, 5e-7, 8e-7, 1e-6 5e-7 Sweep first; all other tuning is secondary
beta 2.0 – 10 2.0 (base), 2.5 (instruct) Higher than DPO by design
gamma/beta_ratio 0.0 – 1.0 0.5 Grid search after fixing learning_rate
Total batch size Fixed 128 Do not vary during hyperparameter search

Batch size, hardware, and reproducibility constraints

The SimPO README is unambiguous: "We recommend keeping the total batch size fixed at 128." Changing batch size mid-sweep conflates two variables and makes it impossible to attribute score changes to the hyperparameter under test.

Hardware heterogeneity creates a second reproducibility layer. The SimPO repo notes that training outcomes can vary with hardware and CUDA versions — a real concern in multi-tenant clusters where NVIDIA H100 nodes may run different driver stacks. vLLM's GPU install documentation states: "If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source." When vLLM is used for eval generation in OpenRLHF-style pipelines, CUDA-version mismatches between training and eval environments can produce silent numerical differences.

OpenRLHF's "Ray + vLLM distributed architecture — scales to 70B+ models" addresses throughput at scale but introduces distributed scheduling complexity. Actor, Reward, Reference, and Critic models are placed on separate GPUs — which means a SimPO run in OpenRLHF (without a reference model) requires re-configuring the default role assignment so the Reference GPU allocation is repurposed.

Constraint Impact on SimPO Mitigation
Batch size drift Score variance across runs Fix at 128; use gradient accumulation to hit it
CUDA version mismatch Silent numerical differences in eval Pin CUDA version in container; validate vLLM build
OpenRLHF distributed roles Reference model slot wasted Reconfigure role assignment for reference-free mode
H100 vs. A100 cluster Different throughput curves affect LR effective range Re-validate learning_rate grid on each hardware class

Tokenizer and evaluation gotchas in vLLM-based workflows

When using vLLM for eval generation with Llama 3 family models, incorrect chat-template handling causes vLLM to prepend two BOS tokens to the prompt. The SimPO repo flags this explicitly: "the updated Llama3 tokenizer with vLLM can introduce two BOS tokens during evaluation if chat-template handling is wrong."

Two BOS tokens change the prompt format seen by the model at eval time versus training time, producing a distribution mismatch that corrupts generation quality independent of alignment method quality. Scores produced under this condition are invalid for method comparison. The fix is to verify that the chat template applied by vLLM's tokenizer matches the template used during training — and to validate with a single prompt inspection before running full benchmark sweeps.

Watch Out: Always print the decoded input_ids for at least one eval prompt before running AlpacaEval 2 or Arena-Hard against a Llama 3 checkpoint via vLLM. A single extra <|begin_of_text|> token in the prefix is enough to skew all downstream scores.

Issue Trigger Effect on scores Fix
Duplicate BOS token Llama 3 tokenizer + vLLM chat template mismatch Corrupted eval inputs Validate chat template; pin tokenizer config
CUDA-version eval drift Different CUDA on eval nodes Silent score shift Containerize eval environment separately
Length bias in AlpacaEval 2 Raw WR vs. LC WR Over-reports verbose models Always report LC WR

Decision matrix for alignment engineers in 2026

The four methods occupy distinct positions on the stability-vs-upside frontier. DPO and ORPO are the safe choices; KTO covers the asymmetric-data case; SimPO is the high-upside, high-variance option.

TRL's stable surface makes DPO the lowest-risk production choice for teams that cannot afford hyperparameter sweeps. OpenRLHF's Ray + vLLM architecture, which "scales to 70B+ models," makes it the preferred infrastructure for KTO and DPO at scale. SimPO's benchmark claims — up to 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard over DPO — are real in context but require disciplined tuning to reproduce.

Scenario Recommended method Runner-up Avoid
First alignment run, no sweep budget DPO ORPO SimPO
Memory-constrained, single-stage required ORPO SimPO DPO
Binary/asymmetric annotation signal KTO DPO ORPO
Competitive benchmark target, sweep infra available SimPO DPO KTO
70B+ model, distributed training on OpenRLHF DPO or KTO SimPO (limited support)
Paper reproduction / baseline establishment DPO SimPO

If you are optimizing for stability

DPO is the most stable choice in TRL: it sits in the stable trainer surface, has years of community debugging, and its hyperparameter behavior is well-documented across dozens of public alignment runs. ORPO is a close second — its single-stage design removes reference-model drift as a failure mode, and ORPOTrainer in TRL is actively maintained.

KTO is stable when the dataset genuinely matches its binary-signal assumption. If you force KTO onto pairwise data by flattening it to binary labels, you lose information and add noise — at which point DPO on the original pairs is the safer option.

SimPO is the least stable method in this set. Its sensitivity to learning_rate and beta, combined with hardware-dependent reproducibility, means a team optimizing for stability should treat SimPO as experimental.

  • Most stable: DPO (TRL) > ORPO (TRL) > KTO (OpenRLHF) > SimPO

If you are optimizing for benchmark upside

SimPO is the highest-upside option. "SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard" — numbers that no other method in this set matches in available public evaluations. Achieving that upside requires: total batch size fixed at 128, learning_rate swept over {3e-7, 5e-7, 8e-7, 1e-6}, beta started at 2.0–2.5, and gamma/beta_ratio swept from 0 to 1 starting at 0.5.

Teams that can afford 4–8 training runs to find the right hyperparameter configuration will find SimPO competitive on AlpacaEval 2 and Arena-Hard. Teams that cannot should default to DPO and treat SimPO's reported gains as a ceiling rather than a baseline.

  • Highest upside: SimPO (with sweep) > DPO (tuned) > ORPO > KTO

Which method should you pick in 2026?

Bottom Line: Start with DPO if you need a reproducible baseline or are working in TRL and OpenRLHF without sweep infrastructure — it remains the most debugged, most cited, and most implementation-stable method in this set. Move to ORPO if reference-model memory overhead is a hard constraint and you want a single-stage pipeline. Choose KTO when your preference data is binary or your positive signal is sparse relative to negative. Invest in SimPO only when you have the tuning discipline to sweep learning_rate over {3e-7–1e-6}, beta starting at 2.0, and gamma/beta_ratio starting at 0.5 — with a locked-down eval harness and a fixed batch size of 128. SimPO's reported gains over DPO are the largest in this method family, but they are hyperparameter-sensitive and hardware-dependent; an untuned SimPO run will not beat a well-tuned DPO run.

FAQ

What is the difference between DPO and KTO? DPO maximizes log-likelihood of pairwise preferences using a reference-model-derived reward. KTO replaces the pairwise likelihood with a Kahneman-Tversky utility objective that treats chosen and rejected signals asymmetrically — matching how humans actually weight losses versus gains. DPO requires paired (chosen, rejected) examples; KTO accepts binary good/bad labels per completion.

Is SimPO better than DPO? Under tuned conditions, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard in the SimPO paper's evaluation regime. Outside those conditions — different model family, untuned hyperparameters, heterogeneous CUDA environment — the advantage does not hold. SimPO is a better method when you can tune it; DPO is a better method when you cannot.

Does ORPO need a reference model? No. ORPO's paper title — "Monolithic Preference Optimization without Reference Model" — states this explicitly. ORPO combines NLL and log odds ratio terms in a single-stage loss without any reference checkpoint.

Which preference optimization method is most stable? DPO, run through TRL's stable trainer surface, is the most stable in practice. ORPO is comparably stable and cheaper to run. SimPO is the least stable due to its sensitivity to learning_rate, beta, and gamma/beta_ratio, compounded by hardware-dependent reproducibility.

How do you tune beta for SimPO? Start beta at 2.0 for base models, 2.5 for instruct-tuned models. The valid range extends up to 10 for highly instruction-following checkpoints. Tune beta after fixing learning_rate from the recommended grid {3e-7, 5e-7, 8e-7, 1e-6}. Hold total batch size at 128 throughout. Tune gamma/beta_ratio last, starting at 0.5 and searching 0–1.

Question Short answer Key source
DPO vs KTO objective Pairwise log-ratio vs. Kahneman-Tversky utility KTO paper
ORPO reference model Not required ORPO paper
Most stable method DPO (TRL stable surface) TRL v1.0 blog
SimPO vs DPO benchmark SimPO +6.4 AE2, +7.5 AH (tuned) SimPO paper
SimPO beta range 2.0–10 depending on model type SimPO README

Sources and references


Keywords: DPO, ORPO, KTO, SimPO, TRL, OpenRLHF, Axolotl, AlpacaEval 2, MT-Bench, Arena-Hard, Llama 3, vLLM, CUDA, NVIDIA H100, beta

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit