AI & ML

LoRA adapters under the hood: why rank, alpha, and weight decomposition change training behavior

Q: How does LoRA work under the hood?

LoRA inserts two matrices $A$ and $B$ in parallel with a frozen base weight. The forward pass computes $W_0 x + \frac{\alpha}{r} BAx$; gradients update only $A$ and $B$. The base weight is never modified during training.

Q: What is the difference between LoRA and rsLoRA?

Only the scaling denominator changes: standard LoRA scales $\Delta W$ by $\alpha/r$; rsLoRA scales by $\alpha/\sqrt{r}$. rsLoRA stabilizes gradient magnitude as rank increases, preventing the undertraining that the original rule causes at high ranks.

Q: What is DoRA in PEFT?

DoRA decomposes the pretrained weight into magnitude and direction, then uses LoRA specifically for directional updates. As the [DoRA paper](https://arxiv.org/abs/2402.09353) states: *"DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters."* The practical benefit is better performance at the same rank, particularly when plain LoRA's single low-rank path cannot approximate the required weight-space geometry.

Q: Is LoRA better than full fine-tuning?

Not in general. LoRA is better when VRAM is constrained, datasets are small, or adapter reuse across deployments matters. Full fine-tuning is better when the task requires broad representational change and sufficient data and compute are available. DoRA partially closes the gap; it does not close it entirely.

Q: What is LoftQ used for in LoRA?

LoftQ addresses quantization-aware initialization. When the base model is quantized to 4-bit, standard LoRA initializes with zero adapter contribution while the base model already carries quantization error. LoftQ jointly optimizes the quantized weight and the initial $BA$ product to approximate the original float precision weight, giving fine-tuning a better starting point. It is distinct from the QLoRA training workflow and from BitsAndBytes' `load_in_4bits` loading mechanism — using one does not imply or require the other. > Bottom Line: rsLoRA is a one-line config change (`use_rslora=True`) that pays off at rank ≥ 32. DoRA is worth testing when LoRA quality is consistently below target and full fine-tuning is not an option. LoftQ is the right initialization for any 4-bit fine-tuning workflow where starting-point quality matters. ---

LoRA works by freezing the base weight matrix and learning a low-rank update AB, and PEFT’s newer variants change the scaling or decomposition of that update: rsLoRA uses alpha/sqrt(r) instead of alpha/r to stabilize higher ranks, while DoRA splits magnitude and direction to improve low-rank performance.

By AxiomLogica Editorial

May 7, 202622 min read

Reviewed by Editorial

LoRA adapters under the hood: why rank, alpha, and weight decomposition change training behavior

What LoRA changes in a frozen transformer layer

LoRA solves a specific problem: how to specialize a large pretrained model without touching most of its weights. Instead of updating every parameter in every linear layer, LoRA inserts a parallel adapter path alongside the frozen base weight and trains only that path. The base model's weights are fixed throughout training; gradients flow only through the adapter.

As the Hugging Face PEFT documentation states directly: "LoRA is low-rank decomposition method to reduce the number of trainable parameters which speeds up finetuning large models and uses less memory." That framing is precise — the mechanism is decomposition, and the payoff is reduced parameter and memory footprint.

Bottom Line: LoRA does not modify the pretrained weight matrix. It adds a learned low-rank update alongside it. During training, only the adapter parameters accumulate gradients, which is what makes it tractable on constrained hardware.

How the low-rank update decomposes the base weight matrix

For a frozen base weight (W_0 \in \mathbb{R}^{d \times k}), the LoRA adapter introduces two matrices (A \in \mathbb{R}^{r \times k}) and (B \in \mathbb{R}^{d \times r}), where (r \ll \min(d, k)). The effective weight used during the forward pass is:

$(W = W_0 + \Delta W = W_0 + BA)$

$A$ is initialized with a random Gaussian draw; $B$ is initialized to zero, so (\Delta W = 0) at the start of training and the adapter begins as a no-op. Only $A$ and $B$ receive gradient updates; (W_0) never changes.

The scaling factor applied to (\Delta W) is (\frac{\alpha}{r}) by default in PEFT, making the effective forward contribution:

$(W = W_0 + \frac{\alpha}{r} \cdot BA)$

This scaling is not incidental — it determines the magnitude of the adapter's influence on every layer's output.

Why freezing the base model changes the optimization problem

Freezing (W_0) eliminates gradients for the bulk of the model. In a standard full fine-tune of a 7B-parameter model with Adam, the optimizer must store first and second moment estimates for each parameter. With LoRA, optimizer state exists only for $A$ and $B$ across targeted layers.

Pro Tip: Optimizer state is often the binding VRAM constraint, not the model weights themselves. A rank-16 LoRA adapter on all linear layers of a 7B model typically adds fewer than 40 million trainable parameters — less than 1% of the base model — cutting Adam's moment tensors from tens of gigabytes to hundreds of megabytes.

The optimization problem also changes in character. You are no longer finding a point in the full weight space; you are searching a much lower-dimensional manifold defined by the product $BA$. This is easier to navigate with limited data — which is why LoRA generalizes well on small task-specific datasets — but it also means the adapter cannot represent arbitrary weight changes. The expressible update space is hard-constrained by rank.

What rank r actually limits in practice

Rank $r$ caps the dimensionality of the subspace that (\Delta W) can span. Because (\Delta W = BA) where $B$ and $A$ have inner dimension $r$, the matrix (\Delta W) has rank at most $r$ — it lives in an $r$-dimensional subspace of the full (d \times k) weight space.

The trainable parameter count for a single LoRA-adapted linear layer is:

$(P_{\text{adapter}} = r \cdot k + d \cdot r = r(d + k))$

For a layer with $d = k = 4096$ (typical in a 7B-class model) and $r = 16$, that is (16 \times 8192 = 131{,}072) parameters versus the original (4096 \times 4096 = 16{,}777{,}216) — a 128× reduction.

Higher rank directly increases expressivity: a rank-64 adapter can represent more diverse updates than a rank-8 adapter. But as the rsLoRA paper demonstrates, increasing rank under the original scaling rule does not monotonically improve outcomes. The scaling factor (\alpha/r) shrinks as $r$ grows, progressively damping the adapter's gradient signal.

Why alpha and rank interact instead of acting independently

The interaction between lora_alpha ((\alpha)) and rank $r$ is one of the most frequently misunderstood aspects of LoRA configuration. They are not independent knobs. Their ratio — (\alpha/r) under the default rule, or (\alpha/\sqrt{r}) under rsLoRA — sets the effective learning rate multiplier for the entire adapter path.

$(\text{effective scale} = \frac{\alpha}{r} \quad \text{(default LoRA)})$

Doubling $r$ while keeping (\alpha) fixed halves the effective scale. The adapter becomes less influential per gradient step, even though it has more capacity.

Watch Out: Setting (\alpha) and $r$ independently without checking their ratio is a common source of silent training failure. A run with $r = 64$ and (\alpha = 16) applies a scale of $0.25$ to every adapter update — four times smaller than a run with (r = 16, \alpha = 16). The model may appear to train (loss decreases) but converges far more slowly, and the effect is easy to attribute wrongly to rank saturation or data problems.

The original alpha-over-r scaling rule

The canonical LoRA scaling as implemented in PEFT is:

$(s = \frac{\alpha}{r})$

PEFT's package reference states this explicitly: "Otherwise, it will use the original default value of lora_alpha/r." The original LoRA paper fixed (\alpha) equal to the first rank tried and kept it constant thereafter, treating (\alpha/r) as a way to avoid retuning the learning rate every time rank changes. In practice, this means the effective step magnitude the adapter applies to each layer is ((\alpha/r) \cdot |BA|) — a quantity that shrinks whenever you increase rank.

This rule works well at low ranks (4–16) because the scale stays within a reasonable range and gradient magnitudes remain healthy. The problem surfaces when practitioners push rank higher hoping for more expressivity.

Why larger ranks can become harder to train

As rank increases under the original rule, the effective scale (\alpha/r) falls, shrinking the contribution of each adapter gradient step. The rsLoRA paper characterizes this precisely: "Each adapter consists of a low-rank matrix product, multiplicatively scaled by a rank-dependent factor. This scaling factor, which divides adapters by a factor of the rank, results in slowed learning and stunted performance for LoRA with higher-rank adapters."

The failure mode is optimization dynamics, not representational capacity. A rank-128 adapter has far more expressive power than a rank-16 adapter, but if the effective scale is (\alpha/128) versus (\alpha/16), the higher-rank adapter receives much smaller gradients per step. You may observe a loss curve that decreases but plateaus early, with the gap between training and validation loss not improving despite more parameters — a symptom many practitioners incorrectly attribute to overfitting or insufficient data.

Pro Tip: When you increase rank and training slows down or plateaus earlier than the lower-rank baseline, check the effective scale before changing anything else. The problem is likely (\alpha/r) shrinkage, not an expressivity ceiling. Either scale (\alpha) proportionally or switch to rsLoRA.

How rsLoRA changes the scaling story

rsLoRA (rank-stabilized LoRA) replaces the (\alpha/r) denominator with (\sqrt{r}):

$(s_{\text{rsLoRA}} = \frac{\alpha}{\sqrt{r}})$

PEFT exposes this through the use_rslora=True flag, which switches the internal scaling from lora_alpha/r to lora_alpha/math.sqrt(r). The effect is that as rank doubles, the effective scale decreases by a factor of (\sqrt{2} \approx 1.41) rather than $2$.

The rsLoRA paper frames the practical consequence clearly: "Modifying LoRA with the appropriate scaling factor, which is called the rank-stabilized LoRA (rsLoRA) method, easily provides for a fine-tuning compute/performance trade-off." The method allows larger ranks to be used meaningfully without the gradient signal collapsing.

Critically, rsLoRA changes only training dynamics. At inference, the merged weight (W_0 + \frac{\alpha}{\sqrt{r}} \cdot BA) is identical in form to the standard merged weight — there is no inference-time cost increase.

Bottom Line: rsLoRA is not a different architecture. It is a rescaled version of the same LoRA update that keeps gradient magnitude stable as rank grows. If you are using rank ≥ 32 and the original scaling is not giving you expected training progress, rsLoRA is the first lever to pull — before adding more data or changing target modules.

When rsLoRA is the right default

rsLoRA becomes the right default when the adapter rank is high enough that the original (\alpha/r) denominator materially damps gradient flow. In practice, that means ranks of 32 and above are strong candidates. The rsLoRA paper explicitly supports using "larger ranks to trade off increased computational resources during training for better fine-tuning performance, with no change in inference computing cost" — a direct argument for rsLoRA in any scenario where you want to push expressivity without paying an inference penalty.

If your task involves a narrow but well-formed dataset and you are keeping rank below 16, the original scaling is typically fine. rsLoRA's benefit grows with rank, so enabling it at rank 8 provides minimal advantage.

Watch Out: rsLoRA stabilizes the scaling math, but it will not rescue a fundamentally flawed setup. If your training data is noisy, your target modules miss the relevant computation paths, or your learning rate is miscalibrated, rsLoRA will produce a faster-training but still-wrong adapter. Scaling fixes are orthogonal to data quality and module selection.

How DoRA reparameterizes updates into magnitude and direction

DoRA (Weight-Decomposed Low-Rank Adaptation) changes what the adapter learns, not just how it scales. Instead of learning a single low-rank correction (\Delta W), DoRA first decomposes the pretrained weight (W_0) into a magnitude component and a directional component, then applies LoRA exclusively to the directional update.

The decomposition is:

$(W = m \cdot \frac{W_0 + BA}{|W_0 + BA|_c})$

where (m \in \mathbb{R}^{1 \times k}) is a learnable per-column magnitude vector, (|\cdot|_c) denotes column-wise normalization, and $BA$ is the standard LoRA update applied to the direction. The DoRA paper states this precisely: "DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters."

The magnitude vector $m$ is a small number of additional scalar parameters (one per output column), so the parameter overhead above standard LoRA is minimal. The structural change is that magnitude and direction can move independently — the adapter can rotate the weight direction while holding its scale constant, or scale it while preserving direction.

Why DoRA can outperform plain LoRA at the same rank

Plain LoRA's low-rank update (\Delta W = BA) modifies magnitude and direction simultaneously but with low-rank structure constraining both. When the target task requires a rotation in weight space that the low-rank factorization cannot approximate well, plain LoRA hits an expressivity ceiling even before exhausting capacity by the rank metric.

DoRA sidesteps this by separating the two degrees of freedom. The directional component is handled through the LoRA path (still low-rank), while the magnitude component has its own unconstrained per-column parameters. The DoRA paper positions this as a method to "bridge the performance gap between standard LoRA and full fine-tuning (FT) while retaining LoRA's efficiency benefits." NVIDIA's developer blog independently presents DoRA as narrowing this gap in empirical evaluations.

The benefit is real but task-dependent. DoRA is not universally superior to LoRA at the same rank — on tasks where the required weight change is already well-approximated by a low-rank matrix (small distributional shift, narrow domain), plain LoRA at the same rank may converge equivalently while being slightly simpler to configure.

Pro Tip: DoRA tends to help most in two situations: when you need to match full fine-tuning quality on a task that plain LoRA consistently underperforms, and when you want to get more out of a fixed rank budget rather than increasing rank (which costs more memory). On creative or highly stylized text generation tasks where weight-space geometry shifts significantly from the base model, the magnitude/direction separation frequently produces measurable improvements.

Where LoftQ fits when the base model is quantized to 4-bit

LoftQ (LoRA-Fine-Tuning-aware Quantization) addresses a specific initialization problem that arises when you quantize the base model before fine-tuning. When a model is quantized to 4-bit, the quantized weights (\hat{W}) differ from the full-precision weights (W_0) by a quantization error (E = W_0 - \hat{W}). Standard LoRA initialization (where $B = 0$ so (\Delta W = 0)) starts training from a point where the adapter contributes nothing, meaning the initial forward pass runs through a degraded weight matrix with no compensating correction. LoftQ solves this by jointly finding a quantized weight (\hat{W}) and an initial $BA$ such that (\hat{W} + BA) approximates (W_0) as closely as possible.

The LoftQ paper describes the framework as "designed specifically for pre-trained models that require quantization and LoRA fine-tuning," framing quantization-aware initialization as the core contribution.

In PEFT, this is configured through LoftQConfig. PEFT's quantization documentation states: "To get started, create a LoftQConfig and set loftq_bits=4 for 4-bit quantization." For 4-bit workflows, PEFT also recommends target_modules="all-linear" — applying LoRA to all linear layers — to maximize the coverage of the quantization-error correction.

Why quantization error changes the initialization story

Without LoftQ, the adapter starts at zero contribution and the base model effectively operates with corrupted weights for the first steps of training. The optimizer must first push the adapter to compensate for quantization error before it can make task-specific progress — wasted capacity and wasted steps.

LoftQ's initialization pre-loads that compensation, giving the adapter a head start and allowing task-specific learning to begin from a better-conditioned starting point.

Watch Out: LoftQ and QLoRA are related but distinct. QLoRA is a training workflow that loads the base model in 4-bit (using BitsAndBytes) and fine-tunes with LoRA on top. LoftQ is an initialization method that reduces the quantization-LoRA mismatch at the start of training. You can use both together, but PEFT's documentation is explicit: "LoftQ initialization does not require quantizing the base model with the load_in_4bits parameter in the from_pretrained method!" — the initialization path does not depend on BitsAndBytes' load_in_4bits. Confusing the two leads to incorrect setup assumptions and potential double-quantization errors.

LoftQ initialization also requires GPU execution — it cannot be treated as a preprocessing step on CPU, per PEFT's quantization documentation.

Which modules to target and why adapter placement matters

Adapter placement directly controls which computations LoRA can modify. In a standard transformer, the main linear projections are attention query/key/value and output projections ((W_q, W_k, W_v, W_o)) and the MLP feed-forward layers ((W_{\text{up}}, W_{\text{down}}, W_{\text{gate}}) in Llama-style architectures). Targeting only a subset restricts which aspects of the model's learned representations the adapter can shift.

PEFT's documentation recommends: "To apply LoRA to all the linear layers, like in QLoRA, set target_modules=\"all-linear\" (easier than specifying individual modules by name which can vary depending on the architecture)." The convenience is real — architecture-specific module names vary across Llama 3.1, Mistral, Qwen, and Phi families, making "all-linear" a durable default.

Target strategy	Modules adapted	Trainable params	Task fit
Attention-only	(W_q, W_k, W_v, W_o)	Lowest	Instruction following, tone shifts
MLP-only	(W_{\text{up}}, W_{\text{down}}, W_{\text{gate}})	Moderate	Knowledge injection, factual recall
All-linear	All of the above	Highest	Broad domain adaptation, 4-bit QLoRA setups

Attention-only targeting is sufficient for tasks that primarily change how the model routes and weighs information — style transfer, following a new instruction format. MLP targeting is more relevant when you need the model to recall or apply new factual associations. All-linear targeting maximizes coverage and is the default recommendation when quantization is involved.

Why all-linear targeting matters for 4-bit fine-tunes

When the base model is quantized to 4-bit with BitsAndBytes, every linear layer carries quantization error. Restricting LoRA to attention layers only means the MLP layers' quantization errors remain permanently uncompensated during fine-tuning. For tasks where MLP pathways are critical — which covers most knowledge-intensive domains — this leaves a systematic performance deficit.

Production Note: PEFT's explicit recommendation of target_modules="all-linear" for QLoRA-style setups reflects this coverage argument. The added trainable parameters from including MLP layers are modest relative to the total adapter count (adapter parameters remain a small fraction of the base model), but the coverage benefit is substantial for tasks with significant domain shift. On a Llama 3.1 8B Instruct model with rank 16, all-linear targeting adds roughly 80M trainable parameters — manageable on a 24GB GPU with a 4-bit base model.

When LoRA loses to full fine-tuning or other PEFT methods

LoRA's low-rank constraint is a feature in data-scarce, VRAM-constrained settings and a genuine limitation when the task requires broad plasticity across the model's weight space. The DoRA paper frames this gap directly: DoRA exists to "bridge the performance gap between standard LoRA and full fine-tuning (FT) while retaining LoRA's efficiency benefits" — an implicit acknowledgment that the gap is real and matters on certain tasks.

Condition	LoRA	rsLoRA	DoRA	Full fine-tuning
Small dataset (< 5K examples), narrow domain	Strong	Marginal gain	Moderate gain	Overfits without careful regularization
Moderate dataset, significant domain shift	Adequate at rank ≥ 32	Better at rank ≥ 32	Best PEFT option	Strong if data supports it
Large dataset, broad capability shift	Underperforms	Underperforms	Partially closes gap	Preferred
4-bit quantized base model	Workable with all-linear	Workable	Workable	Not feasible at 24 GB
Task requires new token vocabulary or embeddings	Weak (embeddings not adapted by default)	Weak	Weak	Required

The rank-sensitivity regime deserves particular attention. On tasks that require the model to change a wide range of its internal representations — adapting from general-purpose text to a specialized schema language, or from English to a low-resource language — even high-rank LoRA may plateau below the full fine-tuning ceiling. The rsLoRA paper suggests that higher ranks can trade training compute for performance, but there is a practical ceiling at which the low-rank manifold simply cannot represent the necessary weight-space movement.

Common failure modes that look like a rank problem but are not

A flat or slowly improving loss curve is the most common symptom attributed to insufficient rank, but rank is usually the last thing to blame.

Wrong target modules: If the task requires MLP-layer changes but only attention layers are targeted, increasing rank in attention projections will not help.
Quantization mismatch: In 4-bit workflows without LoftQ, the adapter starts correcting quantization error before task-specific learning begins. The loss may appear to improve slowly because the early steps are spent recovering from initialization error.
Learning rate mismatch: LoRA adapter parameters often train well at higher learning rates than the base model would tolerate (since the base is frozen). A learning rate tuned for full fine-tuning may be too conservative for the adapter path.
Data quality: A noisy or mislabeled dataset produces the same slow-convergence signature as insufficient rank. rsLoRA and DoRA both amplify gradient signal — which amplifies noise as readily as signal.

Watch Out: Before increasing rank or switching to rsLoRA/DoRA, verify your target modules are actually processing the relevant computation, check your effective scale ((\alpha/r) or (\alpha/\sqrt{r})) is in a reasonable range, and run a small data-quality audit. Changing rank without ruling out these causes wastes compute and delays diagnosis.

What an engineer should check before choosing rank and alpha

Rank and alpha selection is model-, dataset-, and task-specific. There is no universally optimal pair. The table below provides practical starting ranges as heuristics; treat them as informed starting points, not guarantees.

Scenario	Starting rank	Starting alpha	rsLoRA	DoRA	Notes
Narrow domain, small dataset (< 10K)	8–16	16–32	No	Optional	Low rank sufficient; overfitting risk rises with rank
Moderate domain shift, medium dataset	16–64	32–64	Yes if rank ≥ 32	If LoRA underperforms	All-linear targeting recommended
4-bit base model (QLoRA-style)	16–64	16–32	Yes if rank ≥ 32	Optional	Use all-linear; consider LoftQ init
Strong domain shift, large dataset	64–128	64–128	Yes	Consider	rsLoRA required at this rank range
Style/tone only, large base model	4–16	8–16	No	No	Attention-only targeting may suffice

When using Hugging Face PEFT directly, LoraConfig exposes r, lora_alpha, use_rslora, and use_dora as top-level parameters; target_modules="all-linear" is the recommended default for QLoRA-style setups. TRL's SFTTrainer accepts a LoraConfig directly. Axolotl exposes lora_r, lora_alpha, and lora_target_modules in its YAML config, with use_rslora and use_dora as boolean flags. Unsloth wraps these into its FastLanguageModel.get_peft_model call with equivalent parameters, adding kernel-level optimizations on top.

Pro Tip: Start with rank 16 and alpha 32 (effective scale = 2) with rsLoRA disabled. Establish a baseline loss curve. If training appears sluggish relative to a lower-rank pilot, enable rsLoRA before increasing rank further. If the task ceiling is clearly below full fine-tuning quality even at rank 64 with rsLoRA, try DoRA at the same rank before committing to higher ranks or full fine-tuning.

How to read the loss curve for adapter capacity

A well-functioning adapter shows a smooth, monotonically decreasing training loss with a validation curve that tracks within a reasonable gap. Three specific patterns indicate rank or scaling problems:

Early plateau: Loss drops quickly in the first 10–20% of training then stalls. This often indicates the adapter has exhausted its rank capacity early. Increasing rank or switching to DoRA typically addresses it.

Sluggish descent: Loss decreases but far more slowly than a lower-rank baseline at the same learning rate. Per the rsLoRA paper, this is the signature of high-rank adapters under the original (\alpha/r) scaling — the effective gradient contribution has shrunk to the point where learning is artificially slow. Enabling rsLoRA typically restores the expected descent rate.

Divergent validation: Training loss continues to decrease while validation loss stops improving or increases. This is not a rank problem — it is overfitting, usually from too-large a rank relative to dataset size, or from a high learning rate compounded by large effective scale.

Pro Tip: A plateau-then-slow-recovery pattern — where loss stalls for several hundred steps and then resumes gradual descent — often indicates the adapter is working around a quantization error initialization gap rather than a rank ceiling. If you observe this in a 4-bit workflow, LoftQ initialization may collapse the plateau by providing a better-conditioned starting point. DoRA can also reduce this effect through its magnitude reparameterization, since the magnitude component provides an additional degree of freedom for early-step correction.

FAQ on LoRA internals, rsLoRA, DoRA, and LoftQ

How does LoRA work under the hood?

LoRA inserts two matrices $A$ and $B$ in parallel with a frozen base weight. The forward pass computes (W_0 x + \frac{\alpha}{r} BAx); gradients update only $A$ and $B$. The base weight is never modified during training.

What is the difference between LoRA and rsLoRA?

Only the scaling denominator changes: standard LoRA scales (\Delta W) by (\alpha/r); rsLoRA scales by (\alpha/\sqrt{r}). rsLoRA stabilizes gradient magnitude as rank increases, preventing the undertraining that the original rule causes at high ranks.

What is DoRA in PEFT?

DoRA decomposes the pretrained weight into magnitude and direction, then uses LoRA specifically for directional updates. As the DoRA paper states: "DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters." The practical benefit is better performance at the same rank, particularly when plain LoRA's single low-rank path cannot approximate the required weight-space geometry.

Is LoRA better than full fine-tuning?

Not in general. LoRA is better when VRAM is constrained, datasets are small, or adapter reuse across deployments matters. Full fine-tuning is better when the task requires broad representational change and sufficient data and compute are available. DoRA partially closes the gap; it does not close it entirely.

What is LoftQ used for in LoRA?

LoftQ addresses quantization-aware initialization. When the base model is quantized to 4-bit, standard LoRA initializes with zero adapter contribution while the base model already carries quantization error. LoftQ jointly optimizes the quantized weight and the initial $BA$ product to approximate the original float precision weight, giving fine-tuning a better starting point. It is distinct from the QLoRA training workflow and from BitsAndBytes' load_in_4bits loading mechanism — using one does not imply or require the other.

Bottom Line: rsLoRA is a one-line config change (use_rslora=True) that pays off at rank ≥ 32. DoRA is worth testing when LoRA quality is consistently below target and full fine-tuning is not an option. LoftQ is the right initialization for any 4-bit fine-tuning workflow where starting-point quality matters.

Sources and references

Hugging Face PEFT LoRA Developer Guide — Canonical implementation documentation for LoRA in PEFT, including rsLoRA and DoRA flags, target-module configuration, and get_peft_model usage
Hugging Face PEFT LoRA Package Reference — API-level documentation specifying lora_alpha/r default and lora_alpha/math.sqrt(r) rsLoRA scaling
Hugging Face PEFT v0.13.0 LoRA Developer Guide (mirrored) — Versioned snapshot confirming target_modules="all-linear" recommendation for QLoRA-style setups
Hugging Face PEFT Quantization Guide (v0.8.0) — LoftQ configuration documentation, including LoftQConfig, loftq_bits=4, and GPU-only initialization requirement
rsLoRA: A Rank Stabilization Scaling Factor for Fine-Tuning of Large Language Models — arXiv 2312.03732 — Original rsLoRA paper by Kalajdzievski; source for the claim that (\alpha/r) scaling stunts learning at high ranks
DoRA: Weight-Decomposed Low-Rank Adaptation — arXiv 2402.09353 — Original DoRA paper; source for magnitude/direction decomposition and the performance-gap-to-full-fine-tuning framing
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models — arXiv 2310.08659 — Original LoftQ paper; source for quantization-aware initialization methodology and the distinction from plain QLoRA
NVIDIA Developer Blog: Introducing DoRA — NVIDIA's independent technical overview of DoRA and its performance characteristics relative to LoRA

Keywords: LoRA, rsLoRA, DoRA, LoftQ, Hugging Face PEFT, LoraConfig, get_peft_model, QLoRA, BitsAndBytes, Llama 3.1 8B Instruct, TRL, Axolotl, Unsloth, Llama 3.1 70B, NVIDIA H100

Was this guide helpful?

Share: X · LinkedIn · Reddit

What LoRA changes in a frozen transformer layer

How the low-rank update decomposes the base weight matrix

Why freezing the base model changes the optimization problem

What rank r actually limits in practice

Why alpha and rank interact instead of acting independently

The original alpha-over-r scaling rule

Why larger ranks can become harder to train

How rsLoRA changes the scaling story

When rsLoRA is the right default

How DoRA reparameterizes updates into magnitude and direction

Why DoRA can outperform plain LoRA at the same rank

Where LoftQ fits when the base model is quantized to 4-bit

Why quantization error changes the initialization story

Which modules to target and why adapter placement matters

Why all-linear targeting matters for 4-bit fine-tunes

When LoRA loses to full fine-tuning or other PEFT methods

Common failure modes that look like a rank problem but are not

What an engineer should check before choosing rank and alpha

How to read the loss curve for adapter capacity

FAQ on LoRA internals, rsLoRA, DoRA, and LoftQ

How does LoRA work under the hood?

What is the difference between LoRA and rsLoRA?

What is DoRA in PEFT?

Is LoRA better than full fine-tuning?

What is LoftQ used for in LoRA?

Sources and references

The weekly brief.

Related reading

SparseGPT vs Wanda vs structured pruning: what actually preserves LLM quality under compression

Accelerating VLA Fine-Tuning: Implementing OFT (Optimized Fine-Tuning) for OpenVLA

Steering LLM Activations: Implementing Dialz for Concept Manipulation