What ORPO changes in preference alignment
ORPO — Odds Ratio Preference Optimization — eliminates the frozen reference network that DPO-style methods keep running throughout training. The paper describes it as a "reference model-free monolithic odds ratio preference optimization algorithm" in arXiv:2403.07691, and that single design choice restructures the entire post-training stack. Instead of a two-model forward pass per batch (policy + reference), ORPO folds preference learning directly into the supervised fine-tuning loss as a log odds-ratio term. One model, one optimizer state, one backward pass.
The headline — "no reference model" — is true but undersells the architectural consequence. Reference-model-based methods use the frozen policy as a KL anchor: the preference gradient is regularized by how much the live policy diverges from where it started. Remove that anchor and the gradient coupling, calibration, and stability properties of the optimization all shift. The method becomes simpler to operate but harder to make stable from first principles, because the stabilizing pressure that the reference model provided must now come entirely from loss design and dataset quality.
Bottom Line: ORPO combines the supervised fine-tuning objective and the preference penalty into a single training pass by replacing the reference-model divergence term with an odds-ratio signal computed from the live policy alone. This removes a second model from the stack — reducing memory pressure and eliminating the SFT-then-alignment pipeline boundary — but it transfers the stability burden from model anchoring to loss weighting, learning rate, and data quality. The tradeoff is real: ORPO is simpler to operate and faster to iterate on, but it offers fewer guardrails when preference data are noisy or when the likelihood and preference gradients conflict.
The released Mistral-ORPO checkpoints demonstrate the method is viable at 7B scale in arXiv:2403.07691: MT-Bench scores of 7.24 and 7.32, with IFEval instruction-level loose accuracy of 61.63% and 66.19% respectively. Those numbers are competitive with DPO-aligned models of similar size, which is the paper's empirical claim.
How ORPO folds supervised and preference learning into one objective
The combined loss is the mechanism that makes ORPO monolithic. Hugging Face TRL documentation describes ORPO as appending "a simple log odds ratio term to the NLL loss" — which is accurate but collapses the mechanics. The full objective is:
$$ \mathcal{L}{\text{ORPO}} = \mathcal{L} $$}} + \lambda \cdot \mathcal{L}_{\text{OR}
where (\mathcal{L}{\text{SFT}}) is the standard negative log-likelihood on the chosen response, (\mathcal{L}) is the odds-ratio preference penalty, and (\lambda) (implemented as }beta in TRL's ORPOConfig) controls the relative weighting. The Hugging Face TRL ORPO Trainer docs set beta=0.1 as the default, reflecting that the SFT term should dominate in magnitude for most runs.
What DPO does differently: its preference term computes the log-ratio of policy probabilities relative to a frozen reference:
$$ \mathcal{L}{\text{DPO}} = -\log \sigma!\left(\beta \left[\log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right]\right) $$
ORPO replaces (\pi_{\text{ref}}) entirely. The odds-ratio term is constructed from the live policy only, making the preference signal self-referential within a single forward pass.
Pro Tip: The odds-ratio term is not just an implementation convenience — it changes what the gradient measures. In DPO, the gradient encodes how much more the policy prefers the winner than the reference did. In ORPO, it encodes how strongly the policy's current output distribution distinguishes chosen from rejected. That distinction matters when you are deciding whether you need the extra regularization that a reference model provides.
What gets merged from SFT and preference optimization
ORPO retains two specific signals from the two training stages it collapses. From SFT, it keeps token-level likelihood training on the chosen response — TRL's documentation describes this as "a strong adaptation signal to the chosen response." From preference optimization, it keeps the contrastive pairwise signal that distinguishes chosen from rejected — Hugging Face TRL ORPO Trainer docs describe it as "a minor penalty for the disfavored generation together with a strong adaption signal to the chosen response."
This asymmetry is intentional. The NLL term drives fluency, instruction following, and format learning. The odds-ratio term drives separation between what the model should and should not generate. Because both run in the same forward pass, the gradients interact directly rather than being applied sequentially across a pipeline boundary.
Pro Tip: Eliminating the SFT-then-DPO pipeline boundary removes a real operational burden. In a two-stage stack, the SFT checkpoint must be saved, validated, converted if needed for the alignment trainer, and used as both the policy initialization and the reference model. With ORPO, you load one checkpoint, configure one trainer, and run one job. For teams with limited orchestration infrastructure, that reduction in moving parts compounds across many experiments.
The data contract stays preference-pair oriented. ORPO does not work on SFT-only corpora — the trainer needs chosen and rejected responses to construct the odds-ratio signal. If your corpus only has positive demonstrations, ORPO cannot replace an SFT run; it builds on top of one.
Why the reference model disappears in the ORPO formulation
In DPO, the reference model serves as a probability anchor: the preference term measures how much the policy has moved relative to a fixed point. Without that anchor, you need a different way to express "prefer chosen over rejected" that does not require a second model.
ORPO uses the odds ratio of the live policy's output distributions directly in the arXiv paper. For a given prompt $x$, chosen response (y_w), and rejected response (y_l), the log-probability under the current policy (\pi_\theta) is computed for both. The odds ratio is:
$$ \text{OR}\theta(y_w, y_l \mid x) = \frac{\pi\theta(y_w \mid x) / (1 - \pi_\theta(y_w \mid x))}{\pi_\theta(y_l \mid x) / (1 - \pi_\theta(y_l \mid x))} $$
The preference loss maximizes (\log \text{OR}_\theta(y_w, y_l \mid x)), which pushes the policy to assign higher relative probability to (y_w) than to (y_l) — without any reference to where the model started. The ORPO paper is explicit: the method is designed for "eliminating the necessity for an additional preference alignment phase".
OpenRLHF's DPO documentation still exposes --ref_offload and other reference-model controls, which illustrates by contrast how much operational surface the reference model occupies in a standard DPO workflow. ORPO removes that surface entirely — at the cost of removing the stability guarantee it provided.
Objective decomposition and the odds-ratio signal
The ORPO loss "adds an odds-ratio-based penalty to the conventional negative log-likelihood (NLL) loss" for differentiating favored and disfavored responses. Written in full:
$$ \mathcal{L}{\text{ORPO}} = \underbrace{-\frac{1}{|y_w|} \sum_t \log \pi\theta(y_{w,t} \mid x, y_{w,<t})}{\mathcal{L}}}} \;-\; \lambda \cdot \underbrace{\log \sigma!\left(\log \text{OR\theta(y_w, y_l \mid x)\right)} $$}_{\text{OR}}
The SFT term is a standard token-averaged NLL on the chosen sequence. The OR term is a sigmoid-transformed log odds ratio, which keeps gradients bounded. Both terms operate on the same model parameters in the same backward pass.
Watch Out: Over-weighting
beta(the (\lambda) coefficient) causes the OR term to dominate the NLL term. When that happens, the model's generation quality can degrade: fluency and format conformance drop because the likelihood signal is being crowded out by pairwise preference pressure. In practice,betavalues above 0.3–0.5 frequently produce this failure mode. Start at the TRL default of 0.1 and increase only if preference separation on your eval set is clearly insufficient.
The supervised term and what it preserves
The SFT component of the ORPO loss is:
$$ \mathcal{L}{\text{SFT}} = -\frac{1}{|y_w|} \sum) $$}^{|y_w|} \log \pi_\theta(y_{w,t} \mid x, y_{w,<t
This is identical to a standard causal language model fine-tuning loss over the chosen sequence, token-averaged to normalize for length. The term preserves the instruction-following signal that a separate SFT stage would have provided, and TRL's documentation characterizes it as providing "preference-aligned SFT" rather than alignment as a post-processing step.
Pro Tip: Because the SFT term operates at the token level, sequence formatting and chat templates remain load-bearing. A model trained with ORPO on data formatted with Llama's chat template will degrade on inference if the prompt template changes. This is not specific to ORPO, but it matters more here because the SFT and preference objectives share one training run — a chat-template mismatch corrupts both signals simultaneously rather than one stage at a time.
If the chosen-response data are weak — low-quality instructions, inconsistent format, or domain mismatch — the NLL term preserves that weakness at full gradient strength. ORPO does not sanitize input quality; it amplifies whatever is in the chosen column.
The odds-ratio penalty and how it reshapes ranking
The OR penalty term is:
$$ \mathcal{L}{\text{OR}} = -\log \sigma!\left(\log \frac{\pi\theta(y_w \mid x)}{1 - \pi_\theta(y_w \mid x)} - \log \frac{\pi_\theta(y_l \mid x)}{1 - \pi_\theta(y_l \mid x)}\right) $$
Minimizing this loss pushes the policy to assign higher odds to (y_w) relative to (y_l) on every training step. In arXiv:2403.07691, the evaluated Mistral-ORPO checkpoints also reached 11.33% and 12.20% on AlpacaEval 2.0, alongside 61.63% and 66.19% on IFEval instruction-level loose accuracy for the same pair of checkpoints. The ranking penalty materially separates generation styles while preserving instruction-following gains — but that result depends on clean pairwise labels.
Watch Out: The OR penalty is pairwise and sensitive to label consistency. When chosen and rejected responses are close in quality — annotation disagreement, near-tie preference pairs, or automatically labeled data with low confidence — the odds-ratio signal is noisy relative to its gradient magnitude. The model receives conflicting pressure: maximize separation where the signal is uncertain. This is a more acute problem for ORPO than for DPO with a strong reference model, because DPO's reference anchor damps the gradient when the policy is near the reference, providing a natural buffer against label noise.
Why removing the reference model changes optimization dynamics
In DPO, the frozen reference model functions as a KL regularizer embedded in the loss. The gradient of the preference term with respect to policy parameters is modulated by the policy-reference divergence: as the policy moves away from the reference, the gradient from the preference term is implicitly damped. This gives DPO a self-correcting property — aggressive preference updates that would destabilize the model are partially resisted by the reference anchor.
ORPO removes that correction mechanism. The gradient of (\mathcal{L}{\text{OR}}) with respect to (\pi\theta) depends only on the current policy's output on the chosen and rejected responses. There is no signal from where the model was at initialization. The SFT term provides some stability — because it keeps the model anchored to high-likelihood chosen responses — but that anchoring is qualitatively different from reference-model regularization. The SFT term pulls the policy toward the data distribution; the reference model pulls it toward its initialization. When preference data diverge from the SFT data distribution, these two pressures conflict differently in ORPO than in DPO.
The ORPO paper reports competitive benchmark results using a reference-free formulation, which demonstrates that alignment quality can be maintained without a frozen anchor model. But "competitive" is not "universally superior" — the dynamics are different and the conditions for stability are different.
Pro Tip: Because ORPO's only soft constraint on policy movement is the NLL term on chosen responses, its behavior is more sensitive to learning rate than DPO. A learning rate appropriate for DPO may be too aggressive for ORPO — consider starting one half-order of magnitude lower and monitoring the ratio of (\mathcal{L}{\text{SFT}}) to (\mathcal{L}) during training. If the OR loss dominates by an order of magnitude within the first few hundred steps, reduce }
betabefore reducing learning rate.Watch Out: Gradient coupling between the SFT and OR terms is the primary calibration risk in ORPO. When the two terms are poorly scaled, the combined gradient can oscillate: the SFT term increases probability of the chosen response while the OR term simultaneously tries to decrease relative probability of the rejected response, and the interaction can be non-monotone if chosen and rejected responses share long common prefixes. Monitoring per-term losses separately during training is not optional — it is the primary diagnostic surface.
Memory, throughput, and orchestration consequences
Removing the reference model has direct consequences at the infrastructure layer. In a standard DPO or RLHF stack on OpenRLHF, the reference model occupies a separate model slot in the Ray actor pool. OpenRLHF's Ray + vLLM distributed architecture schedules the reference model as a distinct process group, which means it consumes GPU memory — or requires offload to CPU with the --ref_offload option — even when it is not computing gradients. For a 70B model like Llama 3.1 70B in BF16, that reference copy costs roughly 140 GB of GPU memory before activations, quantization, or LoRA.
ORPO eliminates that allocation entirely. The saving is most visible in memory-constrained multi-GPU setups where the reference model would otherwise occupy an entire node or require CPU offloading that throttles throughput.
Bottom Line: ORPO's reference-free design removes one model instance from the training stack — eliminating both the GPU memory allocation and the orchestration layer that schedules, synchronizes, and checkpoints a second model. In TRL, this means the training script is a single-model job with no reference-model config surface. In OpenRLHF, it removes one actor from the Ray pool and one set of
--ref_*parameters. The practical benefit scales with model size: for 7B models the saving is modest; for 70B+ models or LoRA runs on limited H100 capacity, removing the reference copy can be the difference between a run that fits and one that does not.
Where stability burden moves when the reference model is gone
Stability in DPO-style training has two sources: the reference model's KL regularization and the beta hyperparameter that controls how strongly that regularization is weighted. ORPO retains beta — TRL's ORPOConfig exposes it as the primary loss-weighting knob, defaulting to 0.1 — but removes the reference model that beta was originally modulating. In ORPO, beta controls the relative weight of the OR penalty against the NLL term, not the strength of KL regularization.
Watch Out: ORPO's
betaand DPO'sbetaare not the same parameter despite sharing a name in TRL's API. DPO'sbetacontrols policy-reference divergence; ORPO'sbetacontrols OR-to-NLL weighting. Portingbetavalues directly from a DPO config to an ORPO config will produce a miscalibrated run. OpenRLHF's DPO documentation keepsbetaexplicitly tied to KL penalty toward the reference, which makes the semantic difference visible. Start ORPO experiments atbeta=0.1and treat it as an independent hyperparameter requiring its own sweep.
If beta is too low, the OR penalty exerts negligible gradient force and the model trains essentially as a pure SFT run — preference separation will be weak. If beta is too high, the OR term dominates and generation quality degrades. The absence of a reference model narrows the stable range of beta because there is no secondary regularizer to absorb miscalibration.
Implementation implications in TRL and OpenRLHF
| Dimension | TRL ORPOTrainer | OpenRLHF |
|---|---|---|
| ORPO support status | Native, production-ready | Via non-RL DPO-style path; reference-free variant available |
| Reference model config | None required | --ref_offload and reference controls present for DPO; ORPO removes them |
| Primary config knob | ORPOConfig(beta=0.1) |
Distributed training args + loss weight config |
| Dataset format | prompt/chosen/rejected preferred; implicit prompt extraction supported |
Preference-pair format required |
| LoRA/QLoRA support | Yes, via PEFT integration | Yes, with tensor parallelism |
| Orchestration complexity | Single-process or multi-GPU DDP | Ray actor pool, vLLM rollout coordination |
| Best-fit scale | 7B–13B on single node | 30B–70B+ across node boundaries |
TRL and OpenRLHF address different operational scales. TRL's ORPOTrainer is designed for single-node or DDP multi-GPU runs with a Hugging Face-native dataset and model pipeline. OpenRLHF targets production-scale distributed training where tensor parallelism, pipeline parallelism, and vLLM-backed rollout matter.
What the TRL ORPO trainer expects from your dataset
TRL v0.27.0's ORPOTrainer supports both explicit and implicit prompts, meaning it can extract the prompt from the shared prefix of chosen and rejected columns when a prompt field is absent. In practice, rely on explicit formatting: a prompt column with the instruction, a chosen column with the preferred response, and a rejected column with the dispreferred response. Implicit extraction works but is opaque — the extracted prompt depends on the trainer's tokenization behavior and can be difficult to audit when debugging label noise issues.
Pro Tip: Because the ORPO NLL term trains on the
chosenresponse as if it were a supervised fine-tuning target, the data quality bar for thechosencolumn is equivalent to what you would set for an SFT dataset — not just "better than rejected." Weak chosen responses (incomplete answers, format violations, hallucinations) will be reinforced by the NLL term even as the OR term tries to push the model away from rejected. This dual-signal sensitivity is why paired preference data for ORPO should pass the same quality filters you would apply to an SFT corpus.
How OpenRLHF changes the operational picture
OpenRLHF positions itself as a "production-ready open-source RLHF framework" built on Ray and vLLM. For ORPO experiments at 30B–70B+ parameter scale, it offers tensor parallelism and distributed batching that TRL does not natively handle. The reference-model absence simplifies one component of the actor pool, but the Ray orchestration layer — including actor scheduling, fault tolerance, and throughput tuning via micro-batch size and sample packing — remains non-trivial.
Watch Out: Moving to OpenRLHF for ORPO does not mean moving to a simpler system overall. It means trading reference-model complexity for distributed-system complexity. The Ray + vLLM stack requires careful configuration of tensor parallel degree, micro-batch sizes, and rollout buffer sizes, and the failure modes (OOM on the vLLM workers, Ray actor crashes, NCCL timeouts) are distinct from single-node TRL failures. Choose OpenRLHF when model size genuinely requires it, not as a default path for 7B experiments.
When ORPO is a better fit than DPO or KTO
Decision matrix:
- Choose ORPO when you are running post-training on an already instruction-tuned model, your preference dataset is high-quality and consistently labeled, your orchestration budget is limited, and you want to iterate quickly across many preference experiments without managing a two-model stack.
- Choose DPO when you need reference-anchored regularization for stability, your preference data have uncertain quality, you require explicit KL-divergence control, or you are working in a regime where eval regressions are hard to diagnose and the reference model provides a stable debugging anchor.
- Choose KTO when you have unpaired preference data (single responses labeled as good or bad rather than pairwise chosen/rejected), or when you want separate control over desirability and undesirability weights — KTO exposes
desirable_weightandundesirable_weightas distinct hyperparameters, which gives a different loss surface than ORPO's coupled NLL + OR objective.
| Criterion | ORPO | DPO | KTO |
|---|---|---|---|
| Reference model required | No | Yes | No |
| Data format needed | Paired (chosen/rejected) | Paired (chosen/rejected) | Unpaired or paired |
| Primary stability control | beta (OR weight) |
beta (KL weight) + ref model |
beta + desirability weights |
| Pipeline stages | 1 (SFT + preference combined) | 2 (SFT then DPO) or 1 with init | 1 |
| Label noise sensitivity | High | Moderate | Moderate |
| Orchestration overhead | Low | Moderate | Low |
| Comparison purpose | TRL vs OpenRLHF implementation surfaces | DPO baseline | KTO baseline |
TRL and OpenRLHF address different operational scales. TRL's ORPOTrainer is designed for single-node or DDP multi-GPU runs with a Hugging Face-native dataset and model pipeline. OpenRLHF targets production-scale distributed training where tensor parallelism, pipeline parallelism, and vLLM-backed rollout matter.
Cases where ORPO is the right default
ORPO is most compelling when you start from a strong instruction-tuned checkpoint — a model that already follows chat format reliably — and have a clean pairwise preference dataset. The paper's 7B-scale Mistral-ORPO checkpoints achieved MT-Bench scores above 7.2 and IFEval loose accuracy above 61%, which demonstrates that the monolithic objective can reach DPO-competitive performance under favorable data conditions.
Pro Tip: ORPO is especially well-suited when limited H100 capacity prevents running a full reference model alongside the policy. For LoRA or QLoRA runs where VRAM is the binding constraint, removing the reference model can allow a larger effective batch size or longer sequence length in the same hardware envelope — and both of those factors improve ORPO stability more reliably than increasing
beta.
Cases where DPO or KTO may still win
DPO retains an advantage when dataset noise is a known risk. If your preference pairs were labeled by a weak reward model, crowd-sourced with low agreement, or synthesized automatically, the reference model's regularization provides a buffer that ORPO does not.
Watch Out: ORPO's tight coupling of SFT and preference gradients makes it brittle when chosen and rejected responses are weakly distinguished. In that regime, the OR penalty receives noisy gradient signal while the NLL term continues pushing the model toward the (potentially mediocre) chosen responses. The result can be a model that degrades on MT-Bench or Arena-Hard style evals while showing improved pairwise preference accuracy on the training distribution — a classic preference overfitting signature. If you see benchmark regressions alongside training-set preference wins, switch to DPO with a conservative
betabefore tuning further. KTO may also win in this scenario because its utility-based objective is less sensitive to within-pair noise when labels are provided at the response level rather than the pair level.
Failure modes, tuning traps, and evaluation signals
| Benchmark | Mistral-ORPO-α | Mistral-ORPO-β | Notes |
|---|---|---|---|
| MT-Bench | 7.24 | 7.32 | Conversational quality, multi-turn |
| IFEval (instr. loose) | 61.63% | 66.19% | Instruction-following compliance |
| AlpacaEval 2.0 | 11.33% | 12.20% | Pairwise preference vs. GPT-4 |
Source: arXiv:2403.07691. The spread between MT-Bench (7.32) and AlpacaEval 2.0 (12.20%) for the same checkpoint illustrates that benchmark choice materially affects perceived performance. A run that looks like a win on MT-Bench may look flat or regressive on AlpacaEval 2.0, and vice versa.
Hyperparameters that most affect ORPO stability
The four most sensitive knobs, in order of impact:
1. beta (OR loss weight). TRL defaults this to 0.1 in ORPOConfig examples. Values above 0.3 frequently cause generation quality degradation by crowding out the NLL term. Values below 0.01 make ORPO functionally equivalent to SFT.
2. Learning rate. Without a reference model anchoring the policy, the effective update magnitude per step is higher at the same learning rate compared to DPO. Start at 1×10⁻⁵ or lower for 7B models; 5×10⁻⁶ is a reasonable starting point for 13B+.
3. Max sequence length. Longer sequences amplify memory pressure and can cause the NLL and OR terms to compete on batches with variable chosen/rejected lengths. Truncate consistently — asymmetric truncation between chosen and rejected is a common silent failure mode.
4. Batch size and gradient accumulation. The OR signal is computed within each batch's chosen/rejected pairs. Small effective batch sizes increase variance in the odds-ratio gradient, which compounds instability when beta is nonzero.
Watch Out: Loss scaling interactions can be subtle. If you use mixed-precision training with gradient scaling (standard for H100 BF16 runs), the combined ORPO loss can hit a different scale from what a pure SFT loss would produce, causing the gradient scaler to behave differently than expected. Monitor gradient norm per term, not just combined loss, during the first 200–500 steps of any new run.
How to read eval regressions after an ORPO run
A gain on one benchmark alongside a loss on another is diagnostic information, not just noise. The ORPO paper's cross-benchmark spread — where IFEval (66.19%) and AlpacaEval 2.0 (12.20%) measure qualitatively different capabilities on the same checkpoint — confirms that instruction-following and conversational preference metrics are not interchangeable.
Pro Tip: Run all three benchmark categories after every ORPO experiment: a multi-turn conversational quality benchmark (MT-Bench), an instruction-compliance suite (IFEval), and a pairwise preference benchmark (Arena-Hard or AlpacaEval 2.0). Movement patterns tell you what went wrong. If MT-Bench drops but IFEval holds, the model lost multi-turn coherence — likely a sequence-length or chat-template issue. If IFEval drops but MT-Bench holds, instruction-following compliance degraded — likely preference overfitting to style rather than content. If both drop,
betais too high or learning rate is too large.
Use benchmark triage across at least two suites before concluding that an ORPO run is better or worse than baseline. Single-metric decisions on preference-aligned models are notoriously unreliable — a model can overfit the preference distribution and score higher on pairwise evals while degrading on held-out instruction tasks.
FAQ
How does ORPO differ from DPO?
ORPO combines NLL on the chosen response with an odds-ratio penalty on chosen versus rejected in one loss, while DPO computes a contrastive preference loss relative to a frozen reference model. The practical difference is that ORPO removes the reference-model anchor and the extra forward pass, so the training stack is simpler but the optimization is less buffered against noisy pairs.
Why remove the reference model?
ORPO computes the odds ratio from the live policy only, so no frozen anchor is needed. That makes the method lighter to run and easier to fit into a single training job, while DPO keeps the reference model because it provides KL regularization and a stability buffer.
Is ORPO better than DPO?
ORPO is competitive at 7B scale with clean data, but it is generally more brittle when labels are noisy. DPO remains the safer default when reference-anchored stability matters more than reducing memory overhead.
What is the primary config knob?
For ORPO, the primary knob is beta, which weights the OR term against the NLL term. In DPO, beta instead controls KL penalty strength relative to the frozen reference; in KTO, beta is combined with separate desirability weights.
What data format does ORPO require?
ORPO expects paired preference data in prompt/chosen/rejected form. TRL can infer an implicit prompt from shared prefixes, but explicit columns are easier to audit and less error-prone when you are debugging label noise.
How much orchestration complexity does ORPO add?
ORPO keeps orchestration overhead low because it runs as a one-model job. DPO adds a reference model slot, extra synchronization, and more config surface, while KTO also stays one-model but solves a different data problem.
Sources and references
- ORPO paper (arXiv:2403.07691) — primary algorithmic reference for the objective formulation, benchmark results, and the reference-free design.
- Hugging Face TRL ORPO Trainer docs — ORPOTrainer API, ORPOConfig parameters, dataset format, and the quoted description of the SFT and preference signals.
- TRL v0.27.0 ORPO Trainer docs — explicit and implicit prompt handling, and dataset extraction behavior.
- TRL v0.11.0 ORPO Trainer docs — ORPOConfig
beta=0.1default and loss-weighting examples. - OpenRLHF documentation — distributed architecture, DPO reference-model controls, and the operational comparison surface.
- OpenRLHF non-RL training docs — DPO
betaand reference-model config surface. - OpenRLHF common options —
--ref_offloadand reference-model handling options. - TRL additional training methods (DeepWiki summary) — KTO hyperparameter surface:
desirable_weightandundesirable_weight.
Keywords: ORPO, DPO, KTO, TRL v1.0, OpenRLHF, Hugging Face TRL DPOTrainer, Llama 3.1 70B, Axolotl, LoRA, QLoRA, MT-Bench, Arena-Hard, IFEval, arXiv 2403.07691, H100

