What LoRA changes in a frozen transformer layer
LoRA solves a specific problem: how to specialize a large pretrained model without touching most of its weights. Instead of updating every parameter in every linear layer, LoRA inserts a parallel adapter path alongside the frozen base weight and trains only that path. The base model's weights are fixed throughout training; gradients flow only through the adapter.
As the Hugging Face PEFT documentation states directly: "LoRA is low-rank decomposition method to reduce the number of trainable parameters which speeds up finetuning large models and uses less memory." That framing is precise — the mechanism is decomposition, and the payoff is reduced parameter and memory footprint.
Bottom Line: LoRA does not modify the pretrained weight matrix. It adds a learned low-rank update alongside it. During training, only the adapter parameters accumulate gradients, which is what makes it tractable on constrained hardware.
How the low-rank update decomposes the base weight matrix
For a frozen base weight (W_0 \in \mathbb{R}^{d \times k}), the LoRA adapter introduces two matrices (A \in \mathbb{R}^{r \times k}) and (B \in \mathbb{R}^{d \times r}), where (r \ll \min(d, k)). The effective weight used during the forward pass is:
$(W = W_0 + \Delta W = W_0 + BA)$
$A$ is initialized with a random Gaussian draw; $B$ is initialized to zero, so (\Delta W = 0) at the start of training and the adapter begins as a no-op. Only $A$ and $B$ receive gradient updates; (W_0) never changes.
The scaling factor applied to (\Delta W) is (\frac{\alpha}{r}) by default in PEFT, making the effective forward contribution:
$(W = W_0 + \frac{\alpha}{r} \cdot BA)$
This scaling is not incidental — it determines the magnitude of the adapter's influence on every layer's output.
Why freezing the base model changes the optimization problem
Freezing (W_0) eliminates gradients for the bulk of the model. In a standard full fine-tune of a 7B-parameter model with Adam, the optimizer must store first and second moment estimates for each parameter. With LoRA, optimizer state exists only for $A$ and $B$ across targeted layers.
Pro Tip: Optimizer state is often the binding VRAM constraint, not the model weights themselves. A rank-16 LoRA adapter on all linear layers of a 7B model typically adds fewer than 40 million trainable parameters — less than 1% of the base model — cutting Adam's moment tensors from tens of gigabytes to hundreds of megabytes.
The optimization problem also changes in character. You are no longer finding a point in the full weight space; you are searching a much lower-dimensional manifold defined by the product $BA$. This is easier to navigate with limited data — which is why LoRA generalizes well on small task-specific datasets — but it also means the adapter cannot represent arbitrary weight changes. The expressible update space is hard-constrained by rank.
What rank r actually limits in practice
Rank $r$ caps the dimensionality of the subspace that (\Delta W) can span. Because (\Delta W = BA) where $B$ and $A$ have inner dimension $r$, the matrix (\Delta W) has rank at most $r$ — it lives in an $r$-dimensional subspace of the full (d \times k) weight space.
The trainable parameter count for a single LoRA-adapted linear layer is:
$(P_{\text{adapter}} = r \cdot k + d \cdot r = r(d + k))$
For a layer with $d = k = 4096$ (typical in a 7B-class model) and $r = 16$, that is (16 \times 8192 = 131{,}072) parameters versus the original (4096 \times 4096 = 16{,}777{,}216) — a 128× reduction.
Higher rank directly increases expressivity: a rank-64 adapter can represent more diverse updates than a rank-8 adapter. But as the rsLoRA paper demonstrates, increasing rank under the original scaling rule does not monotonically improve outcomes. The scaling factor (\alpha/r) shrinks as $r$ grows, progressively damping the adapter's gradient signal.
Why alpha and rank interact instead of acting independently
The interaction between lora_alpha ((\alpha)) and rank $r$ is one of the most frequently misunderstood aspects of LoRA configuration. They are not independent knobs. Their ratio — (\alpha/r) under the default rule, or (\alpha/\sqrt{r}) under rsLoRA — sets the effective learning rate multiplier for the entire adapter path.
$(\text{effective scale} = \frac{\alpha}{r} \quad \text{(default LoRA)})$
Doubling $r$ while keeping (\alpha) fixed halves the effective scale. The adapter becomes less influential per gradient step, even though it has more capacity.
Watch Out: Setting (\alpha) and $r$ independently without checking their ratio is a common source of silent training failure. A run with $r = 64$ and (\alpha = 16) applies a scale of $0.25$ to every adapter update — four times smaller than a run with (r = 16, \alpha = 16). The model may appear to train (loss decreases) but converges far more slowly, and the effect is easy to attribute wrongly to rank saturation or data problems.
The original alpha-over-r scaling rule
The canonical LoRA scaling as implemented in PEFT is:
$(s = \frac{\alpha}{r})$
PEFT's package reference states this explicitly: "Otherwise, it will use the original default value of lora_alpha/r." The original LoRA paper fixed (\alpha) equal to the first rank tried and kept it constant thereafter, treating (\alpha/r) as a way to avoid retuning the learning rate every time rank changes. In practice, this means the effective step magnitude the adapter applies to each layer is ((\alpha/r) \cdot |BA|) — a quantity that shrinks whenever you increase rank.
This rule works well at low ranks (4–16) because the scale stays within a reasonable range and gradient magnitudes remain healthy. The problem surfaces when practitioners push rank higher hoping for more expressivity.
Why larger ranks can become harder to train
As rank increases under the original rule, the effective scale (\alpha/r) falls, shrinking the contribution of each adapter gradient step. The rsLoRA paper characterizes this precisely: "Each adapter consists of a low-rank matrix product, multiplicatively scaled by a rank-dependent factor. This scaling factor, which divides adapters by a factor of the rank, results in slowed learning and stunted performance for LoRA with higher-rank adapters."
The failure mode is optimization dynamics, not representational capacity. A rank-128 adapter has far more expressive power than a rank-16 adapter, but if the effective scale is (\alpha/128) versus (\alpha/16), the higher-rank adapter receives much smaller gradients per step. You may observe a loss curve that decreases but plateaus early, with the gap between training and validation loss not improving despite more parameters — a symptom many practitioners incorrectly attribute to overfitting or insufficient data.
Pro Tip: When you increase rank and training slows down or plateaus earlier than the lower-rank baseline, check the effective scale before changing anything else. The problem is likely (\alpha/r) shrinkage, not an expressivity ceiling. Either scale (\alpha) proportionally or switch to rsLoRA.
How rsLoRA changes the scaling story
rsLoRA (rank-stabilized LoRA) replaces the (\alpha/r) denominator with (\sqrt{r}):
$(s_{\text{rsLoRA}} = \frac{\alpha}{\sqrt{r}})$
PEFT exposes this through the use_rslora=True flag, which switches the internal scaling from lora_alpha/r to lora_alpha/math.sqrt(r). The effect is that as rank doubles, the effective scale decreases by a factor of (\sqrt{2} \approx 1.41) rather than $2$.
The rsLoRA paper frames the practical consequence clearly: "Modifying LoRA with the appropriate scaling factor, which is called the rank-stabilized LoRA (rsLoRA) method, easily provides for a fine-tuning compute/performance trade-off." The method allows larger ranks to be used meaningfully without the gradient signal collapsing.
Critically, rsLoRA changes only training dynamics. At inference, the merged weight (W_0 + \frac{\alpha}{\sqrt{r}} \cdot BA) is identical in form to the standard merged weight — there is no inference-time cost increase.
Bottom Line: rsLoRA is not a different architecture. It is a rescaled version of the same LoRA update that keeps gradient magnitude stable as rank grows. If you are using rank ≥ 32 and the original scaling is not giving you expected training progress, rsLoRA is the first lever to pull — before adding more data or changing target modules.
When rsLoRA is the right default
rsLoRA becomes the right default when the adapter rank is high enough that the original (\alpha/r) denominator materially damps gradient flow. In practice, that means ranks of 32 and above are strong candidates. The rsLoRA paper explicitly supports using "larger ranks to trade off increased computational resources during training for better fine-tuning performance, with no change in inference computing cost" — a direct argument for rsLoRA in any scenario where you want to push expressivity without paying an inference penalty.
If your task involves a narrow but well-formed dataset and you are keeping rank below 16, the original scaling is typically fine. rsLoRA's benefit grows with rank, so enabling it at rank 8 provides minimal advantage.
Watch Out: rsLoRA stabilizes the scaling math, but it will not rescue a fundamentally flawed setup. If your training data is noisy, your target modules miss the relevant computation paths, or your learning rate is miscalibrated, rsLoRA will produce a faster-training but still-wrong adapter. Scaling fixes are orthogonal to data quality and module selection.
How DoRA reparameterizes updates into magnitude and direction
DoRA (Weight-Decomposed Low-Rank Adaptation) changes what the adapter learns, not just how it scales. Instead of learning a single low-rank correction (\Delta W), DoRA first decomposes the pretrained weight (W_0) into a magnitude component and a directional component, then applies LoRA exclusively to the directional update.
The decomposition is:
$(W = m \cdot \frac{W_0 + BA}{|W_0 + BA|_c})$
where (m \in \mathbb{R}^{1 \times k}) is a learnable per-column magnitude vector, (|\cdot|_c) denotes column-wise normalization, and $BA$ is the standard LoRA update applied to the direction. The DoRA paper states this precisely: "DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters."
The magnitude vector $m$ is a small number of additional scalar parameters (one per output column), so the parameter overhead above standard LoRA is minimal. The structural change is that magnitude and direction can move independently — the adapter can rotate the weight direction while holding its scale constant, or scale it while preserving direction.
Why DoRA can outperform plain LoRA at the same rank
Plain LoRA's low-rank update (\Delta W = BA) modifies magnitude and direction simultaneously but with low-rank structure constraining both. When the target task requires a rotation in weight space that the low-rank factorization cannot approximate well, plain LoRA hits an expressivity ceiling even before exhausting capacity by the rank metric.
DoRA sidesteps this by separating the two degrees of freedom. The directional component is handled through the LoRA path (still low-rank), while the magnitude component has its own unconstrained per-column parameters. The DoRA paper positions this as a method to "bridge the performance gap between standard LoRA and full fine-tuning (FT) while retaining LoRA's efficiency benefits." NVIDIA's developer blog independently presents DoRA as narrowing this gap in empirical evaluations.
The benefit is real but task-dependent. DoRA is not universally superior to LoRA at the same rank — on tasks where the required weight change is already well-approximated by a low-rank matrix (small distributional shift, narrow domain), plain LoRA at the same rank may converge equivalently while being slightly simpler to configure.
Pro Tip: DoRA tends to help most in two situations: when you need to match full fine-tuning quality on a task that plain LoRA consistently underperforms, and when you want to get more out of a fixed rank budget rather than increasing rank (which costs more memory). On creative or highly stylized text generation tasks where weight-space geometry shifts significantly from the base model, the magnitude/direction separation frequently produces measurable improvements.
Where LoftQ fits when the base model is quantized to 4-bit
LoftQ (LoRA-Fine-Tuning-aware Quantization) addresses a specific initialization problem that arises when you quantize the base model before fine-tuning. When a model is quantized to 4-bit, the quantized weights (\hat{W}) differ from the full-precision weights (W_0) by a quantization error (E = W_0 - \hat{W}). Standard LoRA initialization (where $B = 0$ so (\Delta W = 0)) starts training from a point where the adapter contributes nothing, meaning the initial forward pass runs through a degraded weight matrix with no compensating correction. LoftQ solves this by jointly finding a quantized weight (\hat{W}) and an initial $BA$ such that (\hat{W} + BA) approximates (W_0) as closely as possible.
The LoftQ paper describes the framework as "designed specifically for pre-trained models that require quantization and LoRA fine-tuning," framing quantization-aware initialization as the core contribution.
In PEFT, this is configured through LoftQConfig. PEFT's quantization documentation states: "To get started, create a LoftQConfig and set loftq_bits=4 for 4-bit quantization." For 4-bit workflows, PEFT also recommends target_modules="all-linear" — applying LoRA to all linear layers — to maximize the coverage of the quantization-error correction.
Why quantization error changes the initialization story
Without LoftQ, the adapter starts at zero contribution and the base model effectively operates with corrupted weights for the first steps of training. The optimizer must first push the adapter to compensate for quantization error before it can make task-specific progress — wasted capacity and wasted steps.
LoftQ's initialization pre-loads that compensation, giving the adapter a head start and allowing task-specific learning to begin from a better-conditioned starting point.
Watch Out: LoftQ and QLoRA are related but distinct. QLoRA is a training workflow that loads the base model in 4-bit (using BitsAndBytes) and fine-tunes with LoRA on top. LoftQ is an initialization method that reduces the quantization-LoRA mismatch at the start of training. You can use both together, but PEFT's documentation is explicit: "LoftQ initialization does not require quantizing the base model with the load_in_4bits parameter in the from_pretrained method!" — the initialization path does not depend on BitsAndBytes'
load_in_4bits. Confusing the two leads to incorrect setup assumptions and potential double-quantization errors.
LoftQ initialization also requires GPU execution — it cannot be treated as a preprocessing step on CPU, per PEFT's quantization documentation.
Which modules to target and why adapter placement matters
Adapter placement directly controls which computations LoRA can modify. In a standard transformer, the main linear projections are attention query/key/value and output projections ((W_q, W_k, W_v, W_o)) and the MLP feed-forward layers ((W_{\text{up}}, W_{\text{down}}, W_{\text{gate}}) in Llama-style architectures). Targeting only a subset restricts which aspects of the model's learned representations the adapter can shift.
PEFT's documentation recommends: "To apply LoRA to all the linear layers, like in QLoRA, set target_modules=\"all-linear\" (easier than specifying individual modules by name which can vary depending on the architecture)." The convenience is real — architecture-specific module names vary across Llama 3.1, Mistral, Qwen, and Phi families, making "all-linear" a durable default.
| Target strategy | Modules adapted | Trainable params | Task fit |
|---|---|---|---|
| Attention-only | (W_q, W_k, W_v, W_o) | Lowest | Instruction following, tone shifts |
| MLP-only | (W_{\text{up}}, W_{\text{down}}, W_{\text{gate}}) | Moderate | Knowledge injection, factual recall |
| All-linear | All of the above | Highest | Broad domain adaptation, 4-bit QLoRA setups |
Attention-only targeting is sufficient for tasks that primarily change how the model routes and weighs information — style transfer, following a new instruction format. MLP targeting is more relevant when you need the model to recall or apply new factual associations. All-linear targeting maximizes coverage and is the default recommendation when quantization is involved.
Why all-linear targeting matters for 4-bit fine-tunes
When the base model is quantized to 4-bit with BitsAndBytes, every linear layer carries quantization error. Restricting LoRA to attention layers only means the MLP layers' quantization errors remain permanently uncompensated during fine-tuning. For tasks where MLP pathways are critical — which covers most knowledge-intensive domains — this leaves a systematic performance deficit.
Production Note: PEFT's explicit recommendation of
target_modules="all-linear"for QLoRA-style setups reflects this coverage argument. The added trainable parameters from including MLP layers are modest relative to the total adapter count (adapter parameters remain a small fraction of the base model), but the coverage benefit is substantial for tasks with significant domain shift. On a Llama 3.1 8B Instruct model with rank 16, all-linear targeting adds roughly 80M trainable parameters — manageable on a 24GB GPU with a 4-bit base model.
When LoRA loses to full fine-tuning or other PEFT methods
LoRA's low-rank constraint is a feature in data-scarce, VRAM-constrained settings and a genuine limitation when the task requires broad plasticity across the model's weight space. The DoRA paper frames this gap directly: DoRA exists to "bridge the performance gap between standard LoRA and full fine-tuning (FT) while retaining LoRA's efficiency benefits" — an implicit acknowledgment that the gap is real and matters on certain tasks.
| Condition | LoRA | rsLoRA | DoRA | Full fine-tuning |
|---|---|---|---|---|
| Small dataset (< 5K examples), narrow domain | Strong | Marginal gain | Moderate gain | Overfits without careful regularization |
| Moderate dataset, significant domain shift | Adequate at rank ≥ 32 | Better at rank ≥ 32 | Best PEFT option | Strong if data supports it |
| Large dataset, broad capability shift | Underperforms | Underperforms | Partially closes gap | Preferred |
| 4-bit quantized base model | Workable with all-linear | Workable | Workable | Not feasible at 24 GB |
| Task requires new token vocabulary or embeddings | Weak (embeddings not adapted by default) | Weak | Weak | Required |
The rank-sensitivity regime deserves particular attention. On tasks that require the model to change a wide range of its internal representations — adapting from general-purpose text to a specialized schema language, or from English to a low-resource language — even high-rank LoRA may plateau below the full fine-tuning ceiling. The rsLoRA paper suggests that higher ranks can trade training compute for performance, but there is a practical ceiling at which the low-rank manifold simply cannot represent the necessary weight-space movement.
Common failure modes that look like a rank problem but are not
A flat or slowly improving loss curve is the most common symptom attributed to insufficient rank, but rank is usually the last thing to blame.
- Wrong target modules: If the task requires MLP-layer changes but only attention layers are targeted, increasing rank in attention projections will not help.
- Quantization mismatch: In 4-bit workflows without LoftQ, the adapter starts correcting quantization error before task-specific learning begins. The loss may appear to improve slowly because the early steps are spent recovering from initialization error.
- Learning rate mismatch: LoRA adapter parameters often train well at higher learning rates than the base model would tolerate (since the base is frozen). A learning rate tuned for full fine-tuning may be too conservative for the adapter path.
- Data quality: A noisy or mislabeled dataset produces the same slow-convergence signature as insufficient rank. rsLoRA and DoRA both amplify gradient signal — which amplifies noise as readily as signal.
Watch Out: Before increasing rank or switching to rsLoRA/DoRA, verify your target modules are actually processing the relevant computation, check your effective scale ((\alpha/r) or (\alpha/\sqrt{r})) is in a reasonable range, and run a small data-quality audit. Changing rank without ruling out these causes wastes compute and delays diagnosis.
What an engineer should check before choosing rank and alpha
Rank and alpha selection is model-, dataset-, and task-specific. There is no universally optimal pair. The table below provides practical starting ranges as heuristics; treat them as informed starting points, not guarantees.
| Scenario | Starting rank | Starting alpha | rsLoRA | DoRA | Notes |
|---|---|---|---|---|---|
| Narrow domain, small dataset (< 10K) | 8–16 | 16–32 | No | Optional | Low rank sufficient; overfitting risk rises with rank |
| Moderate domain shift, medium dataset | 16–64 | 32–64 | Yes if rank ≥ 32 | If LoRA underperforms | All-linear targeting recommended |
| 4-bit base model (QLoRA-style) | 16–64 | 16–32 | Yes if rank ≥ 32 | Optional | Use all-linear; consider LoftQ init |
| Strong domain shift, large dataset | 64–128 | 64–128 | Yes | Consider | rsLoRA required at this rank range |
| Style/tone only, large base model | 4–16 | 8–16 | No | No | Attention-only targeting may suffice |
When using Hugging Face PEFT directly, LoraConfig exposes r, lora_alpha, use_rslora, and use_dora as top-level parameters; target_modules="all-linear" is the recommended default for QLoRA-style setups. TRL's SFTTrainer accepts a LoraConfig directly. Axolotl exposes lora_r, lora_alpha, and lora_target_modules in its YAML config, with use_rslora and use_dora as boolean flags. Unsloth wraps these into its FastLanguageModel.get_peft_model call with equivalent parameters, adding kernel-level optimizations on top.
Pro Tip: Start with rank 16 and alpha 32 (effective scale = 2) with rsLoRA disabled. Establish a baseline loss curve. If training appears sluggish relative to a lower-rank pilot, enable rsLoRA before increasing rank further. If the task ceiling is clearly below full fine-tuning quality even at rank 64 with rsLoRA, try DoRA at the same rank before committing to higher ranks or full fine-tuning.
How to read the loss curve for adapter capacity
A well-functioning adapter shows a smooth, monotonically decreasing training loss with a validation curve that tracks within a reasonable gap. Three specific patterns indicate rank or scaling problems:
Early plateau: Loss drops quickly in the first 10–20% of training then stalls. This often indicates the adapter has exhausted its rank capacity early. Increasing rank or switching to DoRA typically addresses it.
Sluggish descent: Loss decreases but far more slowly than a lower-rank baseline at the same learning rate. Per the rsLoRA paper, this is the signature of high-rank adapters under the original (\alpha/r) scaling — the effective gradient contribution has shrunk to the point where learning is artificially slow. Enabling rsLoRA typically restores the expected descent rate.
Divergent validation: Training loss continues to decrease while validation loss stops improving or increases. This is not a rank problem — it is overfitting, usually from too-large a rank relative to dataset size, or from a high learning rate compounded by large effective scale.
Pro Tip: A plateau-then-slow-recovery pattern — where loss stalls for several hundred steps and then resumes gradual descent — often indicates the adapter is working around a quantization error initialization gap rather than a rank ceiling. If you observe this in a 4-bit workflow, LoftQ initialization may collapse the plateau by providing a better-conditioned starting point. DoRA can also reduce this effect through its magnitude reparameterization, since the magnitude component provides an additional degree of freedom for early-step correction.
FAQ on LoRA internals, rsLoRA, DoRA, and LoftQ
How does LoRA work under the hood?
LoRA inserts two matrices $A$ and $B$ in parallel with a frozen base weight. The forward pass computes (W_0 x + \frac{\alpha}{r} BAx); gradients update only $A$ and $B$. The base weight is never modified during training.
What is the difference between LoRA and rsLoRA?
Only the scaling denominator changes: standard LoRA scales (\Delta W) by (\alpha/r); rsLoRA scales by (\alpha/\sqrt{r}). rsLoRA stabilizes gradient magnitude as rank increases, preventing the undertraining that the original rule causes at high ranks.
What is DoRA in PEFT?
DoRA decomposes the pretrained weight into magnitude and direction, then uses LoRA specifically for directional updates. As the DoRA paper states: "DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters." The practical benefit is better performance at the same rank, particularly when plain LoRA's single low-rank path cannot approximate the required weight-space geometry.
Is LoRA better than full fine-tuning?
Not in general. LoRA is better when VRAM is constrained, datasets are small, or adapter reuse across deployments matters. Full fine-tuning is better when the task requires broad representational change and sufficient data and compute are available. DoRA partially closes the gap; it does not close it entirely.
What is LoftQ used for in LoRA?
LoftQ addresses quantization-aware initialization. When the base model is quantized to 4-bit, standard LoRA initializes with zero adapter contribution while the base model already carries quantization error. LoftQ jointly optimizes the quantized weight and the initial $BA$ product to approximate the original float precision weight, giving fine-tuning a better starting point. It is distinct from the QLoRA training workflow and from BitsAndBytes' load_in_4bits loading mechanism — using one does not imply or require the other.
Bottom Line: rsLoRA is a one-line config change (
use_rslora=True) that pays off at rank ≥ 32. DoRA is worth testing when LoRA quality is consistently below target and full fine-tuning is not an option. LoftQ is the right initialization for any 4-bit fine-tuning workflow where starting-point quality matters.
Sources and references
- Hugging Face PEFT LoRA Developer Guide — Canonical implementation documentation for LoRA in PEFT, including rsLoRA and DoRA flags, target-module configuration, and
get_peft_modelusage - Hugging Face PEFT LoRA Package Reference — API-level documentation specifying
lora_alpha/rdefault andlora_alpha/math.sqrt(r)rsLoRA scaling - Hugging Face PEFT v0.13.0 LoRA Developer Guide (mirrored) — Versioned snapshot confirming
target_modules="all-linear"recommendation for QLoRA-style setups - Hugging Face PEFT Quantization Guide (v0.8.0) — LoftQ configuration documentation, including
LoftQConfig,loftq_bits=4, and GPU-only initialization requirement - rsLoRA: A Rank Stabilization Scaling Factor for Fine-Tuning of Large Language Models — arXiv 2312.03732 — Original rsLoRA paper by Kalajdzievski; source for the claim that (\alpha/r) scaling stunts learning at high ranks
- DoRA: Weight-Decomposed Low-Rank Adaptation — arXiv 2402.09353 — Original DoRA paper; source for magnitude/direction decomposition and the performance-gap-to-full-fine-tuning framing
- LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models — arXiv 2310.08659 — Original LoftQ paper; source for quantization-aware initialization methodology and the distinction from plain QLoRA
- NVIDIA Developer Blog: Introducing DoRA — NVIDIA's independent technical overview of DoRA and its performance characteristics relative to LoRA
Keywords: LoRA, rsLoRA, DoRA, LoftQ, Hugging Face PEFT, LoraConfig, get_peft_model, QLoRA, BitsAndBytes, Llama 3.1 8B Instruct, TRL, Axolotl, Unsloth, Llama 3.1 70B, NVIDIA H100


