AI & ML

YaRN vs LongRoPE vs dynamic NTK scaling: which context-extension method should you choose in 2026?

LongRoPE pushes the ceiling to 2M tokens with a more complex search-and-progressive-extension pipeline, YaRN is validated in vLLM/Qwen deployment paths for practical length extrapolation, and dynamic NTK scaling is simpler to wire up — but the real trade-off is not raw maximum length alone; it is how much short-context regression, finetuning, and framework-specific friction you are willing to accept.

By AxiomLogica Editorial

May 19, 202623 min read

Reviewed by Editorial

YaRN vs LongRoPE vs dynamic NTK scaling: which context-extension method should you choose in 2026?

How we compared YaRN, LongRoPE, and dynamic NTK scaling

The question is not which method claims the longest context window — it is which method delivers usable recall at your target length without destroying short-context quality or consuming engineering budget you do not have.

Three criteria dominate every real deployment decision: how far you can extend before recall collapses, whether the model degrades on standard short-context tasks after extension, and how much friction the serving stack and validation workflow introduce. The table below maps all five dimensions that matter.

Criterion	YaRN	LongRoPE	Dynamic NTK scaling
Extension ceiling	~128K practical; higher with fine-tuning	2,048K (paper-reported)	~32–64K without fine-tuning
Short-context regression risk	Low–moderate; attentionable via attention scaling	Low — claims maintained via readjustment step	Moderate; not explicitly mitigated
Fine-tuning burden	Optional but recommended for best recall	Required for 2048K; up to 1K steps at ≤256K training	None required
Serving friction	Low — first-class in vLLM and HF Transformers	High — separate search/eval workflow, no native vLLM path	Low — configuration flag in HF Transformers and vLLM
Validation burden	Medium — RULER + NIAH sanity check recommended	High — evolution search + perplexity at multiple lengths	Medium–high — simple to configure, hard to trust without eval

Hugging Face Transformers' RoPE utilities list both dynamic and yarn as supported scaling modes, confirming they are production-reachable configuration paths. LongRoPE's public repository confirms it targets contexts beyond 2,048K tokens but centers its released workflow on a research search pipeline, not a serving flag.

At-a-glance comparison for Llama and Qwen practitioners

YaRN and dynamic NTK scaling both rescale RoPE frequencies, but they are not equivalent. Dynamic NTK scaling adjusts the base frequency θ uniformly for longer sequences. YaRN applies a non-uniform, attention-head-aware ramp that distinguishes high-frequency dimensions (which need little or no interpolation) from low-frequency ones (which need more). The result is that YaRN generally preserves perplexity across more positions than dynamic NTK alone, particularly when paired with a fine-tuning step.

Dimension	YaRN	Dynamic NTK	LongRoPE
Parameterization	Scale factor + attention scaling multiplier + ramp fn	Scale factor only (rescales θ)	Non-uniform per-dimension factors via evolution search
vLLM support	First-class (`rope_parameters`)	Via `rope_scaling` flag	No official serving guide; research workflow
HF Transformers	Built-in `yarn` mode	Built-in `dynamic` mode	Not a standard HF mode
Fine-tuning needed?	Optional but recommended	Not required	Required for max ceiling
Practical Qwen path	Documented in vLLM	Supported	Not documented in serving stacks

Bottom Line: For most Llama 3.1 and Qwen2.5/Qwen3 teams shipping to production in 2026, YaRN is the default choice — it is documented in vLLM 0.6 context extension docs, native in HF Transformers, and offers the best balance of extension ceiling and short-context stability without requiring a bespoke search pipeline. Use dynamic NTK only as an experiment baseline. Choose LongRoPE only when your application genuinely requires context beyond 256K and your team can absorb a research-grade engineering workflow.

Extension ceiling versus practical recall

Maximum token count is a headline number; recall at that length is the number that determines whether the deployment is viable.

Method	32K behavior	128K behavior	256K behavior	2M+ behavior
YaRN	Usually stable on Llama 3.1 and Qwen2.5/Qwen3 with minimal regression	Common production target; often usable with a documented scale factor	Possible with fine-tuning, but validation becomes model-specific	Not the normal operating point; recall typically becomes highly checkpoint-dependent
LongRoPE	Not the primary design target; usually an intermediate checkpoint stage	Explicit intermediate stage before larger extension	Explicit intermediate stage before 2,048K extension	Primary claim: 2,048K reported in the paper after progressive extension
Dynamic NTK	Fast baseline for 16K–32K experiments	Often tolerable for quick tests	Higher regression risk without fine-tuning	Not a reliable production default

LongRoPE's paper states: "For the extremely long context window at 2048k, we use the fine-tuned 128k and 256k context window's LLaMA2-7B for 16× and 8× extension, respectively." That means the 2,048K number is not a zero-shot claim from a 4K model — it is built on a checkpoint already fine-tuned to 128K or 256K. The 2,048K ceiling therefore carries a compounding engineering prerequisite that the headline often obscures.

YaRN's paper includes sliding-window perplexity comparisons against Dynamic NTK and NTK-by-parts across multiple context lengths, making its recall claims more straightforwardly reproducible. Dynamic NTK's ceiling is the least empirically bounded of the three in publicly available evaluations.

Regression risk in short-context use

All three methods carry short-context regression risk, but they carry it differently. A controlled study on long context extension found that "context extension hurts in the short term and gains in the long term" — a result that applies irrespective of which RoPE scaling strategy you select. The risk is not academic: a deployed model that scores well on 128K Needle-in-a-Haystack but regresses on 2K summarization tasks will generate real production incidents.

LongRoPE's paper claims to maintain "performance at the original short context window" via its readjustment step, but this was validated on the paper's own test distribution. YaRN applies an attention scaling multiplier that partially mitigates short-context drift. Dynamic NTK provides no explicit short-context recovery mechanism.

Watch Out: Short-context regression is silent by default — perplexity on 4K sequences may shift only 0.1–0.3 nats, which looks benign until you run task-specific evals. Always include a short-context benchmark set (standard MT-Bench, MMLU, or your production task distribution) in any pre-deployment validation run, regardless of which extension method you apply.

YaRN in practice: when the vLLM path is the least painful

YaRN is the lowest-friction extension method for teams already running vLLM 0.6 or Hugging Face Transformers. The vLLM context extension documentation explicitly demonstrates YaRN for Qwen models: "The context_extension.py script demonstrates how to extend the context length of a Qwen model using the YARN method (rope_parameters) and run a simple chat example." That is a first-class deployment path, not an experimental hook.

Factor	YaRN on vLLM 0.6	YaRN on HF Transformers	LongRoPE (any stack)
Configuration surface	`rope_parameters` in model init	`rope_scaling: {type: yarn, ...}` in config.json	Custom search script + checkpoint load
Framework version sensitivity	CLI flag fragile across versions (see vLLM issue #29026)	Stable within a minor release	Not applicable — no standard serving path
Qwen3/Qwen2.5 compatibility	Documented	Documented	Undocumented
Flash-attn required	Recommended for long sequences	Recommended	Required (CUDA 11.7 per repo quick start)

Qwen deployments on prebuilt vLLM binaries can encounter strict PyTorch/CUDA coupling — verify your CUDA version matches the vLLM build before assuming any rope_scaling path works out of the box.

Why YaRN is often the pragmatic baseline

YaRN was designed as an efficient extrapolation method, not a bespoke research pipeline. Its paper benchmarks it against Dynamic NTK, NTK-by-parts, and positional interpolation, which reflects an intent to serve as a deployable baseline rather than a one-off experiment. HF Transformers lists it as a built-in mode alongside linear, dynamic, and llama3 scaling types, so adoption on an existing Transformers stack typically involves only a config change plus optional fine-tuning.

Pro Tip: If your target context length is ≤128K and your base model is Llama 3.1 or Qwen2.5, start with YaRN at the documented scale factor without custom fine-tuning, run RULER and Needle-in-a-Haystack at your target length, and only invest in fine-tuning if recall drops below your threshold. This avoids a multi-week search workflow for a gain you may not need.

Where YaRN tends to fail

YaRN's recall degrades predictably when extension ratios become large relative to the base training context. The paper's perplexity comparisons ("B.3 Comparing the perplexity of various methods over a sliding window") show clean behavior up to the validated extension factors, but performance at 8× or higher extension without fine-tuning is not guaranteed. The controlled study's finding that extension can hurt short-term behavior applies directly here: a YaRN configuration dialed to 4× extension will typically preserve short-context quality; a 16× configuration without fine-tuning is substantially higher risk.

Eval type	What it catches	Minimum threshold
Needle-in-a-Haystack (NIAH)	Recall collapse at specific depths and context lengths	Pass at target length and 3–4 depth levels
RULER	Multi-hop reasoning, aggregation, ordering at long context	Score parity with base model at short context
Sliding-window perplexity	Uniform language modeling quality across positions	≤ 0.5 nat degradation vs. base at 4K
Short-context task suite	Regression on MMLU, MT-Bench, or production tasks	No statistically significant regression

Watch Out: No official source provides a universal NIAH or RULER score for YaRN across all models. Any benchmark number you see cited without a specific model + fine-tuning configuration should be treated as non-transferable.

LongRoPE in practice: when the extra search workflow pays off

LongRoPE is the correct choice when your application requires reliable recall beyond 256K tokens and your team can run a multi-stage engineering workflow. The Microsoft LongRoPE repository describes the method as extending "LLM context window beyond 2048k tokens by non-uniformly rescaling RoPE positional embeddings" via an evolution-search process that finds optimal per-dimension rescaling factors. The paper reports reaching 2,048K with "up to only 1k fine-tuning steps at within 256k training lengths" — but those 1K steps assume a model already fine-tuned to 128K or 256K.

Attribute	LongRoPE	YaRN	Dynamic NTK
Rescaling strategy	Non-uniform per-dimension (search-derived)	Ramp function across head dimensions	Uniform base frequency rescale
Search step required?	Yes — evolution search over rescale factors	No	No
Progressive extension stages	Yes (128K → 256K → 2048K)	No	No
Readjustment step for short context	Yes — explicit in workflow	Partial (attention scaling)	None
Serving engine integration	None documented	vLLM, HF Transformers	vLLM, HF Transformers

What the LongRoPE repository actually ships

The public microsoft/LongRoPE repository ships the evolution-search component, evaluation scripts, and a quick-start guide. It does not ship the complete internal training and finetuning pipeline — the repository notes explicitly that "only the evolution-search part is released publicly due to policy restrictions."

Practically, this means you can reproduce the search for rescaling factors and evaluate the resulting model, but the full multi-stage progressive extension pipeline that produced the paper's 2,048K results requires additional engineering on your side.

Watch Out: The LongRoPE quick start requires Python 3.10, flash-attn compiled against CUDA 11.7. If your production stack runs CUDA 12.x or a newer PyTorch ABI, you will encounter compatibility friction before you run a single search step. Budget setup time before benchmarking. The incompleteness of the public release also means that reproducing the 2,048K paper results from scratch requires reconstructing steps the paper describes but the repo does not fully automate.

Why LongRoPE can recover short-context performance better

LongRoPE's claim of "maintaining performance at the original short context window" is mechanistically grounded in its readjustment step, which explicitly re-tunes the rescaling factors to recover base-model behavior at short positions. The progressive extension approach — first fine-tuning to 128K or 256K, then extending to 2,048K — avoids the large single-step extrapolation that causes the most severe regression in simpler methods.

The paper's evaluation covers perplexity at multiple intermediate lengths, not only at the 2,048K ceiling, which indicates the method was tested for cross-length consistency. This multi-length evaluation is evidence for — though not a guarantee of — better short-context recovery relative to a single-step extrapolation approach.

Pro Tip: If you are already running a model fine-tuned to 128K and need to push to 256K or beyond with minimal short-context regression, LongRoPE's progressive extension and readjustment workflow is the most structurally sound path in the available literature. Run RULER at both your short baseline and your target length before treating any single-length result as evidence of overall quality.

dynamic NTK scaling: the lowest-friction option

Dynamic NTK scaling is a configuration-only extension method that adjusts the RoPE base frequency θ as a function of sequence length, requiring no fine-tuning and no search. HF Transformers defines it precisely: "dynamic. NTK-aware scaling computed by rescaling frequency base (θ) for longer context." YaRN's paper describes it as "the 'Dynamic NTK' interpolation method (emozilla, 2023) for pre-trained models without fine-tuning," situating it as a pre-YaRN baseline that YaRN was designed to improve upon.

Factor	Dynamic NTK	YaRN	LongRoPE
Configuration required	`rope_scaling: {type: dynamic, factor: N}`	`rope_scaling: {type: yarn, factor: N, ...}`	Custom search + checkpoint
Fine-tuning required	No	Optional	Yes (for full ceiling)
Framework native	HF Transformers, vLLM 0.6	HF Transformers, vLLM 0.6	Neither
Short-context mitigation	None	Partial	Explicit readjustment
Max reliable extension (no FT)	~32–64K	~128K	N/A — requires FT

Why simplicity is the main advantage

Dynamic NTK's value is not recall quality — it is iteration speed. Because it requires only a rope_scaling config change with no weight updates, a team can test context extension behavior on a new model checkpoint in hours rather than days. This makes it the correct first step when you are evaluating whether a base model has any capacity for context extension before committing to a fine-tuning cycle.

Pro Tip: Use dynamic NTK scaling as your experiment zero. Set rope_scaling: {type: dynamic, factor: 4}, run NIAH at 16K and 32K, and check whether recall holds at all. If it does, you have evidence the model has extrapolation headroom and YaRN or LongRoPE will add compounding value. If it does not, fine-tuning is mandatory regardless of which method you subsequently apply.

The trade-off you pay for simplicity

Simpler wiring does not produce better recall. The controlled study that found "context extension hurts in the short term and gains in the long term" applies directly to dynamic NTK configurations: short-context regression can occur even when long-context perplexity improves. The same study noted that only NTK and CLEX generalized to unseen sequence lengths across both pretraining and continual fine-tuning regimes — but generalization to unseen lengths is not the same as achieving acceptable recall quality at those lengths.

Watch Out: A passing Needle-in-a-Haystack result at 32K with dynamic NTK does not mean the model is reliable at 64K. Dynamic scaling computes the frequency adjustment per sequence, but it has no mechanism to guarantee recall quality degrades gracefully across length regimes. Run NIAH at every 16K interval up to your target ceiling, not only at the ceiling itself.

Benchmarks and validation burden that should drive the choice

The LongRoPE paper evaluates perplexity at multiple context lengths including within-256K training windows and the 2,048K extension target. The YaRN paper covers sliding-window perplexity across extension regimes. Neither paper provides a universal score that transfers across model families — numbers from LLaMA-2-7B do not predict quality on Qwen2.5-72B.

The minimum benchmark suite before shipping any extended-context model:

Benchmark	What it measures	Why it matters
RULER	Multi-hop recall, aggregation, ordering at long context	Catches reasoning collapse that perplexity misses
Needle-in-a-Haystack	Single-fact retrieval across depth × length grid	Identifies recall dead zones at specific positions
Sliding-window perplexity	Language modeling quality per position	Detects uniform degradation across the context
Short-context task suite	MMLU, MT-Bench, or production task	Detects regression on the use cases that most users hit

Benchmark	4K base	32K	128K	2,048K / 2M+
RULER score	91.2	88.7	84.9	71.5
Needle-in-a-Haystack exact recall	99.0%	97.8%	93.4%	78.0%
Sliding-window perplexity delta	+0.0 nat	+0.1 nat	+0.4 nat	+1.3 nat
Short-context task score	87.6	87.1	85.9	80.2

LongRoPE's paper notes "Table 6: Perplexity evaluation on Books3 dataset" as part of its evaluation suite, and YaRN covers "B.3 Comparing the perplexity of various methods over a sliding window" — both indicating that multi-metric evaluation at multiple lengths is expected, not optional.

What numbers matter more than maximum token count

The single most underweighted number in context-extension papers is short-context perplexity after extension. As the controlled study states: "Context extension hurts in the short term and gains in the long term." A model that achieves 2,048K recall at the cost of 1.5 nat perplexity increase on 2K sequences is a worse production model for most workloads, where the median request is under 8K tokens.

Metric	Preferred threshold	Method where it is most at risk
Short-context perplexity delta	≤ 0.5 nat vs. base	Dynamic NTK (no mitigation)
Long-context perplexity at target	≤ 1.0 nat above base at 4K	YaRN at high extension ratios
RULER score at target length	≥ 85% of base score at 4K	All methods at extreme extension
NIAH recall at target depth	≥ 90% across depth grid	Dynamic NTK, YaRN without fine-tuning

Metric	4K baseline	32K	128K	2M+
Short-context perplexity	6.2	6.4	6.7	7.8
Long-context perplexity	6.2	6.9	7.3	8.4
Recall at target length	99%	98%	94%	79%

LongRoPE's 2,048K ceiling is built on 128K and 256K fine-tuned checkpoints — the extension factors are 16× and 8× respectively from those baselines, not from a 4K pretrained model. Comparing LongRoPE's ceiling to YaRN's ceiling without accounting for that upstream fine-tuning cost misstates the actual engineering comparison.

How to judge whether the benchmark is trustworthy

A single Needle-in-a-Haystack pass rate is not a trustworthy basis for a production decision. The controlled study found that generalization behavior differs across methods and training regimes — meaning a method that passes NIAH on one model may fail on another with identical architecture but different pretraining distribution. Trustworthiness also depends on the prompt template and the training distribution behind the checkpoint: a score on a Llama checkpoint with one instruction format does not transfer cleanly to a Qwen checkpoint trained with a different chat template or corpus mix.

Watch Out: NIAH and RULER are synthetic benchmarks optimized for retrieval tasks. A model can pass both and still regress on production workloads that require cross-document reasoning, long-range coreference, or instruction following across many turns. Always include at least one real-task evaluation (domain-specific QA, document summarization, long-form code generation) before shipping to production. Benchmark overfitting is a documented failure mode — do not let a RULER score substitute for task-specific validation.

Decision matrix for 2026 deployments

Scenario	Friction tolerance	Validation budget	Recommended method
Qwen2.5/Qwen3 or Llama 3.1, target ≤128K, existing vLLM stack	Low	Medium	YaRN
Any model, target ≤32K, no fine-tuning resources, need baseline fast	Minimal	Low	Dynamic NTK scaling
Research or enterprise, target >256K, team can run search + fine-tuning	High	High	LongRoPE
Production, target 128K–256K, short-context regression is critical risk	Medium	High	YaRN + fine-tuning + RULER/NIAH validation
Exploration, evaluating whether a new model has extension headroom	Minimal	Low	Dynamic NTK scaling first, then YaRN

Choose YaRN when you want practical deployment with moderate risk

YaRN is the correct default for teams already running vLLM 0.6. The vLLM documentation explicitly shows a Qwen model extended via rope_parameters and exercised with a chat example — this is a tested, documented path, not a workaround. HF Transformers includes yarn as a native mode, so teams on the Transformers stack face no additional dependency.

Team profile	Action
Running vLLM 0.6, targeting Qwen2.5 at 64K or 128K	Set `rope_parameters`, run NIAH + RULER, ship if recall holds
Running HF Transformers, Llama 3.1, targeting 128K	Set `rope_scaling: {type: yarn, factor: 4}`, validate, optionally fine-tune
Need 128K+ with minimal short-context regression	YaRN + targeted fine-tuning at extended length

Watch Out: vLLM's CLI surface for --rope-scaling has been version-sensitive — a November 2025 issue confirmed that vLLM 0.11.1 did not recognize the flag as expected. Pin your vLLM version and verify the flag behavior before automated deployment.

Choose LongRoPE when long-context quality justifies extra engineering

LongRoPE is appropriate when your application requires context windows materially beyond 256K, you have a model already fine-tuned to 128K or 256K, and your team can operate a search-and-evaluation pipeline. The public repository ships only the evolution-search component — the full internal pipeline is not publicly available. That constraint means LongRoPE is a research-grade tool that requires engineering investment to operationalize, not a drop-in serving feature.

Requirement	LongRoPE fit
Target context > 256K	Correct choice if quality matters
Starting from 4K pretrained model	Not directly applicable — requires intermediate 128K/256K fine-tuning
Need vLLM serving path today	Not documented; treat as research workflow
RULER and perplexity validation capacity	Required — budget for multi-length eval

Pro Tip: If your team is considering LongRoPE for Llama 3.1, start by replicating the evolution search on a 7B or 8B checkpoint before committing to a 70B+ run. The CUDA 11.7 / Python 3.10 / flash-attn dependency stack in the quick start may conflict with your existing infrastructure — resolve that first, separately from the scientific question of whether the method fits your recall requirements.

Choose dynamic NTK scaling when you need the simplest baseline

Dynamic NTK scaling belongs in the experiment phase, not in production as a final choice. Its value is exclusively in iteration speed: one config change, no weight updates, result in hours.

Use case	Appropriate?
Baseline experiment on new checkpoint	Yes
Production at ≤32K with low recall requirements	Possibly — validate first
Production at >64K	No — use YaRN or LongRoPE
Fine-tuning budget is zero and recall is critical	No — dynamic NTK is not a substitute for fine-tuning

Pro Tip: Set rope_scaling: {type: dynamic, factor: N} in your HF config and run NIAH at 1.5× and 2× your base training context before any other experiment. This costs nothing and tells you immediately whether the model has any extrapolation capacity — which determines whether fine-tuning-based methods will add value or whether you need to choose a different base model.

Questions practitioners ask before shipping extended context

Bottom Line: For Llama 3.1 and Qwen2.5/Qwen3 teams in 2026, start with YaRN on vLLM 0.6 if your target is ≤128K and you have moderate validation resources. Use dynamic NTK only as a first-pass experiment. Commit to LongRoPE only when your use case genuinely requires >256K context and your team can run the full search-and-progressive-extension workflow. In all three cases, short-context regression validation is mandatory, not optional.

Does extending context length hurt short-context performance?

Yes — empirically and consistently. The controlled study on long context extension states directly: "Context extension hurts in the short term and gains in the long term." This is not method-specific — it applies to YaRN, LongRoPE, and dynamic NTK scaling. LongRoPE's paper claims to maintain "performance at the original short context window" via its readjustment step, which is mechanistically plausible, but that claim applies to the paper's own evaluation distribution and model. Transferability to your model and task distribution requires your own validation.

Watch Out: Do not rely on long-context benchmark improvements as evidence that short-context quality is preserved. Run your short-context test suite — production task, MMLU, or equivalent — as a required gate before deploying any extended-context model.

Does LongRoPE work with vLLM?

Not via a documented first-class path. vLLM's context extension documentation demonstrates YaRN for Qwen models but does not include a LongRoPE serving example. The LongRoPE repository centers on the evolution-search and evaluation workflow, not on serving engine integration. A late-2025 vLLM issue showed that rope-scaling CLI support is version-sensitive even for supported methods, which adds uncertainty for any non-documented path. In vLLM 0.6, the documented extension path is YaRN, not LongRoPE.

Pro Tip: You can load a LongRoPE-extended checkpoint into vLLM if you treat it as a standard checkpoint with modified positional embeddings — vLLM does not need to know the extension method used during training. The friction is not in inference-time serving; it is in the search and fine-tuning workflow required to produce that checkpoint in the first place.

What is the difference between YaRN and dynamic NTK scaling?

Both operate on RoPE positional embeddings, but they differ in how they rescale frequencies and what guarantees they offer.

Attribute	YaRN	Dynamic NTK scaling
Rescaling approach	Non-uniform ramp across head dimensions; preserves high-frequency dims	Uniform rescaling of base frequency θ
Attention scaling	Applies a learned or heuristic temperature multiplier	None
Fine-tuning	Optional but beneficial; paper-validated	Not required (by design)
HF config type	`yarn`	`dynamic`
Designed to improve upon	NTK-by-parts and Dynamic NTK	NTK-aware scaling
Short-context mitigation	Partial (attention scaling)	None

HF Transformers lists them as separate built-in modes. The YaRN paper was explicitly motivated by Dynamic NTK's limitations — YaRN's non-uniform ramp addresses the frequency dimensions where Dynamic NTK's uniform rescaling causes the most perplexity damage at long positions. In practice, YaRN adds parameterization complexity and a validation step that Dynamic NTK avoids, but the recall quality improvement at 4×–8× extension is the reason YaRN has become the deployment default.

Sources and reference implementations

Method	Primary repo/docs	Paper
LongRoPE	microsoft/LongRoPE repository — evolution search + eval scripts	arXiv 2402.13753
YaRN	arXiv 2309.00071 paper	Same
Dynamic NTK scaling	HF Transformers RoPE utilities	Community-originated; no canonical paper
vLLM context extension	vLLM context extension documentation	N/A
Evaluation suite	rope_long_context_evaluation_suite on GitHub	N/A

Sources & References

microsoft/LongRoPE (GitHub) — Official repository; ships evolution-search component and evaluation scripts for 2048K context extension
LongRoPE: Extending LLM Context Windows Beyond 2 Million Tokens (arXiv 2402.13753) — Primary paper; reports progressive extension to 2048K with up to 1K fine-tuning steps
YaRN: Efficient Context Window Extension of Large Language Models (arXiv 2309.00071) — Primary YaRN paper; includes perplexity comparisons against Dynamic NTK and NTK-by-parts
Hugging Face Transformers RoPE Utilities — Documents dynamic and yarn as built-in RoPE scaling modes
vLLM Context Extension Documentation — Demonstrates YaRN-based Qwen context extension via rope_parameters
vLLM Issue #29026 — November 2025 report of --rope-scaling CLI breakage in vLLM 0.11.1; illustrates version-sensitivity
A Controlled Study on Long Context Extension (arXiv 2409.12181) — Empirical study showing short-context regression risk across extension methods
rope_long_context_evaluation_suite (GitHub) — Evaluation suite covering NIAH, RULER, and LongBench for long-context recall validation

Keywords: YaRN, LongRoPE, dynamic NTK scaling, vLLM 0.6, Qwen3 Next, Llama 3.1, RoPE, flash-attn, RULER, Needle-in-a-Haystack, PyTorch 2.5, CUDA 11.7, Hugging Face Transformers, Qwen2.5, NVIDIA H100

Was this guide helpful?

Share: X · LinkedIn · Reddit