AI & ML

How to fine-tune Llama 3, Qwen, or Mistral with LoRA on a single 24GB GPU

QLoRA makes 8B-class models practical on 24GB cards by combining 4-bit NF4 quantization with LoRA adapters, but the memory win comes with slower training than plain LoRA and tighter sensitivity to sequence length, batch size, and target module choice.

By AxiomLogica Editorial

May 16, 202622 min read

Reviewed by Editorial

How to fine-tune Llama 3, Qwen, or Mistral with LoRA on a single 24GB GPU

At a Glance: What you need before you start

At a Glance: Time: 2–4 hours for setup and first run · Prereqs: Python 3.10+, familiarity with Hugging Face Transformers and basic training loops · Hardware: single NVIDIA GPU with 24GB VRAM (RTX 3090, RTX 4090, A10G, or equivalent); 12GB cards can work only with stricter QLoRA settings and smaller models · RAM: 32GB system RAM minimum · Storage: 50GB free SSD space for base weights, tokenizer, dataset, checkpoints, and exported adapters · OS: Linux strongly preferred; Windows is supported by most packages but is significantly less stable for bitsandbytes and CUDA combinations · Cost: $0 if you own the hardware; cloud spot instances with 24GB cards (A10G) run roughly $0.50–$1.50/hr

A 24GB GPU is enough to fine-tune Llama 3 8B, Qwen2.5 7B, and Mistral 7B using LoRA or QLoRA — but only if you respect the constraints on sequence length, batch size, and quantization configuration. The Hugging Face PEFT library documents LoRA as a low-rank decomposition method that "reduces the number of trainable parameters, speeds up finetuning, and uses less memory" — the mechanism that makes 7B/8B-class fine-tuning tractable on a single consumer card. The QLoRA paper (Dettmers et al.) demonstrated this scaling principle at the extreme: memory for fine-tuning a 65B model dropped from over 780GB to under 48GB, which contextualizes exactly how powerful the NF4 + LoRA combination is at the sizes you'll actually run on 24GB.

What a 24GB budget can and cannot fit

A 24GB card comfortably handles QLoRA on 7B–8B-class models: Llama 3.1 8B Instruct, Qwen2.5 7B Instruct, and Mistral 7B Instruct v0.3 all fit with room for reasonable sequence lengths and gradient accumulation. The practical ceiling for full-precision (BF16) LoRA without quantization sits around the same model class — a 7B model in BF16 consumes roughly 14GB of VRAM for weights alone, leaving headroom for optimizer states, activations, and the adapter itself.

What does not fit: full fine-tuning of any model at this scale (optimizer states alone for a 7B model in AdamW exceed 24GB), and any parameter class above ~13B in a standard QLoRA configuration on 24GB. The QLoRA paper's 65B benchmark required a 48GB card — still double your budget even with NF4, double quantization, and paged optimizers active. Do not attempt 70B-class Llama 3 on a single 24GB card.

Production Note: Model class is only one VRAM dimension. Max sequence length, micro-batch size, and the number of target modules you adapt all directly affect peak allocation. A 7B model with max_seq_length=4096 and per_device_train_batch_size=4 will OOM on 24GB. The same model with max_seq_length=1024 and per_device_train_batch_size=2 will not. Treat sequence length as the primary knob, not model size.

QLoRA trades training speed for memory. Expect slower per-step throughput compared to plain BF16 LoRA because the quantized base weights require dequantization at each forward pass. If your model already fits in 24GB with BF16 LoRA, plain LoRA is often the faster and simpler choice.

Install the single-GPU training stack

The core stack — Hugging Face Transformers, PEFT, bitsandbytes, Accelerate, and TRL — is well integrated. As the PEFT repository states, "PEFT is integrated with Transformers for easy model training and inference ... and Accelerate for distributed training and inference for really big models." On a single GPU, Accelerate handles launch configuration and mixed-precision bookkeeping; TRL provides the SFTTrainer abstraction that respects chat templates and applies loss masking correctly.

Choose Linux, CUDA, and the right driver version

Linux with a validated CUDA 12.x driver is the correct operating environment for this stack. bitsandbytes links against CUDA at install time, and a driver/CUDA version mismatch — for example, a PyTorch build compiled against CUDA 12.1 loaded on a driver that only supports CUDA 11.8 — will either fail at import or produce silent numerical errors during the 4-bit quantization pass.

Watch Out: Windows builds of bitsandbytes lag behind Linux releases and have documented CUDA kernel instability on consumer GPUs. If you are on Windows and see RuntimeError: CUDA error: device-side assert triggered during the first forward pass, the driver/library mismatch is the likely cause. Switch to Linux or use WSL2 with a validated CUDA toolkit before investing time in hyperparameter tuning.

Use nvidia-smi to confirm your installed driver version, then cross-reference with the PyTorch CUDA compatibility table before installing. Pin all packages to a tested combination rather than installing latest-of-everything.

Install Transformers, PEFT, bitsandbytes, Accelerate, and TRL

$ pip install torch==2.3.0 torchvision --index-url https://download.pytorch.org/whl/cu121
$ pip install transformers==4.44.0 \
              peft==0.13.0 \
              bitsandbytes==0.43.3 \
              accelerate==0.34.0 \
              trl==0.10.1 \
              datasets==2.21.0

Version-pin every package. bitsandbytes and PEFT minor versions frequently change behavior around quantization initialization, adapter merging, and target_modules resolution. PEFT v0.13.0 is the version referenced by the official LoRA documentation used throughout this guide.

TRL's SFTTrainer is mandatory for supervised fine-tuning workflows that use chat templates — it applies DataCollatorForCompletionOnlyLM correctly and handles the apply_chat_template call on your tokenizer. Doing this manually is error-prone and the most common source of silent label leakage.

Optional Unsloth path for faster iteration

Unsloth advertises "Finetune 2X faster on a single NVIDIA GPU" through custom CUDA kernels for the LoRA forward and backward passes. The throughput claim is roughly accurate for attention-heavy models like Llama 3 and Mistral on modern Ampere/Ada cards when sequence lengths are short.

Pro Tip: Use Unsloth when iteration speed is the constraint and you have already validated your QLoRA configuration on the standard HF stack. Unsloth's custom kernels can obscure VRAM allocation behavior — peak memory reported by torch.cuda.max_memory_allocated() may not match what you'd see in a standard run. If you are debugging an OOM or trying to establish reliable memory budgets for a production training job, run one reference epoch with the plain HF stack first before switching.

Unsloth does not support every adapter variant (DoRA support in particular has lagged the PEFT reference implementation), so confirm compatibility before committing to it for your architecture.

Pick the base model and training mode

The choice between Llama 3, Qwen, and Mistral is primarily a function of your dataset's language distribution, the instruction format your use case requires, and whether you need post-training instruction fidelity or raw generation quality. All three fit on 24GB with QLoRA; the decision table below covers when plain BF16 LoRA is the better path.

Model	Params	BF16 LoRA fits 24GB?	QLoRA fits 24GB?	Best use case
Llama 3.1 8B Instruct	8B	Yes, at short seq lengths	Yes, comfortably	English instruction tuning, code adaptation
Qwen2.5 7B Instruct	7B	Yes	Yes	Multilingual tasks, instruction fidelity
Mistral 7B Instruct v0.3	7B	Yes	Yes	Throughput-sensitive fine-tuning, simple baselines

Use plain BF16 LoRA when model weights plus optimizer state fit within 24GB at your target sequence length — usually true for 7B/8B models at max_seq_length ≤ 1024 with batch size 1–2 and gradient accumulation. Use QLoRA when pushing to longer sequences, larger effective batch sizes, or when your card has less headroom due to background processes.

Llama 3: when 8B is the safe default

Llama 3.1 8B Instruct is the practical default for English-language domain adaptation on a 24GB card. At 8B parameters in BF16, weights alone consume approximately 16GB, which leaves 8GB for activations, optimizer states, and the LoRA adapter modules — sufficient for batch size 1–2 with max_seq_length=1024 and gradient accumulation. With QLoRA, that weight footprint drops to roughly 5–6GB, giving you significantly more room.

Pro Tip: Llama 3's 70B variant is not a realistic target on a single 24GB card in any configuration — even with 4-bit NF4, a 70B model requires approximately 40GB for weights alone. If you need a larger Llama 3 variant, move to a multi-GPU setup or use a cloud instance with 80GB+ VRAM. For single-card work, 8B is the ceiling you can push with confidence.

Qwen: when multilingual or instruction fidelity matters

Qwen2.5 7B Instruct is the strongest option when your fine-tuning dataset contains non-English text or when your use case demands tight instruction-following behavior (legal clause extraction, structured output generation, or domain-specific QA with precise output schemas).

Pro Tip: Qwen-series models are sensitive to prompt format. Qwen2.5 uses its own chat template ({% for message in messages %}...) that differs from Llama 3's. If your dataset's system/user/assistant role formatting does not match the tokenizer's built-in chat template, you will see degraded instruction following from the first epoch — the loss curve may look healthy while task accuracy drops. Always call tokenizer.apply_chat_template() on your dataset with tokenize=False and inspect five examples before training.

Mistral: when throughput and simplicity matter

Mistral 7B Instruct v0.3 remains the simplest baseline for single-GPU fine-tuning: straightforward architecture, no tied embeddings, and well-understood behavior under LoRA. Per-step throughput is typically higher than Llama 3 8B at the same sequence length because the model is slightly smaller.

Watch Out: Mistral 7B uses sliding window attention in its original configuration, but v0.3 relaxes this for shorter contexts. The primary VRAM pressure on Mistral runs is still sequence length. Setting max_seq_length=2048 on a 24GB card with batch size > 1 will cause OOMs even under QLoRA. Start at 512–1024 and verify memory headroom before extending context.

Set up QLoRA for the tightest VRAM budget

QLoRA's memory reduction comes from three components working together: 4-bit NormalFloat (NF4) quantization of the frozen base model weights, double quantization (quantizing the quantization constants themselves), and paged optimizers that offload optimizer state to CPU RAM under memory pressure. Together, as Dettmers et al. showed, these techniques can "reduce memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance". At the 7B/8B scale on 24GB, the same techniques provide comfortable headroom instead of a tight squeeze.

The configuration block that enables the full QLoRA setup:

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 is the correct type for QLoRA
    bnb_4bit_use_double_quant=True,      # double quantization saves ~0.4 bits/param
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in BF16 for numerical stability
)

Pass this config to AutoModelForCausalLM.from_pretrained() via quantization_config=bnb_config. Do not set load_in_4bit=True without specifying bnb_4bit_quant_type="nf4" — the default generic 4-bit path uses a different quantization grid and produces lower-quality fine-tuned adapters.

Why NF4 beats generic 4-bit for this workflow

NF4 is a quantization data type specifically designed for normally distributed neural network weights. Weights in pre-trained LLMs follow approximately zero-mean normal distributions, and NF4's quantization grid is information-theoretically optimal for that distribution — meaning each 4-bit bin captures an equal probability mass of the weight distribution rather than equal value intervals. Generic 4-bit (integer quantization) uses equal-width intervals, which wastes precision in the high-density central region of the distribution.

The PEFT LoRA documentation explicitly recommends NF4 for quantized training, and the QLoRA paper's benchmarks were obtained using NF4 — not a generic 4-bit alternative. The PEFT guide also documents additional adapter initialization options including rsLoRA (which scales adapter contributions by 1/√r for more stable training at higher ranks) and DoRA (which decomposes weight updates into magnitude and direction components).

Watch Out: Aggressive quantization on domain-sensitive instruction tuning tasks can degrade output quality even when training loss looks acceptable. If your fine-tuning dataset requires precise structured output — JSON schemas, legal clause formatting, code syntax — run a qualitative eval on 20–30 held-out examples after the first epoch. NF4 with double quantization is the least aggressive 4-bit option; do not downgrade to 2-bit or INT4 without domain-specific quality validation.

Use `target_modules="all-linear"` when you want LoftQ-style coverage

The target_modules parameter controls which weight matrices receive LoRA adapters. Targeting a narrow set (e.g., only q_proj and v_proj) reduces trainable parameters and VRAM but limits the adapter's capacity to adjust model behavior. For most domain adaptation tasks on 24GB, broader coverage produces better results.

The PEFT documentation states: "To apply LoRA to all the linear layers, like in QLoRA, set target_modules=\"all-linear\"." This is the recommended default for QLoRA-style training and is equivalent to the LoftQ initialization strategy, which targets all linear layers to maximize adapter coverage on a quantized base.

from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    r=16,                          # rank; 8–32 is typical for 24GB
    lora_alpha=32,                 # scaling factor; set to 2x rank as a starting point
    target_modules="all-linear",   # cover all linear layers per PEFT/QLoRA recommendation
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    # use_rslora=True,             # uncomment for rsLoRA scaling at higher ranks
)

target_modules="all-linear" is also architecture-agnostic — it works across Llama 3, Qwen2.5, and Mistral without you needing to enumerate model-specific layer names like gate_proj, up_proj, or k_proj separately. This is the safer default when moving between model families.

If you want DoRA instead of standard LoRA, add use_dora=True to the config. Be aware that PEFT documentation notes "DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference."

Tune sequence length, batch size, and gradient accumulation

Sequence length is the dominant VRAM variable — more so than rank, batch size, or the number of target modules. Activation memory scales quadratically with sequence length in a standard attention implementation, meaning doubling max_seq_length from 512 to 1024 roughly quadruples activation memory before any other changes.

Production Note: On a 24GB card with a 7B/8B QLoRA setup, max_seq_length=512 is the safe starting point. At that length, batch size 2–4 with gradient_accumulation_steps=8 gives an effective batch of 16–32 without requiring multi-GPU. Moving to max_seq_length=1024 typically requires dropping to per_device_train_batch_size=1. Moving to max_seq_length=2048 will require verifying headroom with torch.cuda.max_memory_allocated() before committing to a long run. Sequence length is the first knob to reduce when an OOM occurs — not rank, not batch size.

Use gradient accumulation to compensate for small micro-batches. An effective batch size of 16–32 produces stable gradients for instruction tuning without requiring VRAM for large per-step batches.

Load the dataset in the format your trainer expects

TRL's SFTTrainer expects either a text column containing pre-formatted strings or a messages column containing lists of role/content dictionaries that apply_chat_template can process. The second format is cleaner and less error-prone.

from datasets import load_dataset

# Dataset should have a "messages" column with role/content dicts
# Example row: {"messages": [{"role": "system", "content": "..."}, 
#                             {"role": "user", "content": "..."}, 
#                             {"role": "assistant", "content": "..."}]}
dataset = load_dataset("json", data_files={"train": "data/train.jsonl"}, split="train")

# Verify a sample before training
print(tokenizer.apply_chat_template(dataset[0]["messages"], tokenize=False))

Keep dataset files on SSD, not a network mount. Slow data loading stalls GPU utilization on a single-card setup where there is no other GPU to prefetch for.

Use chat-style messages for instruction tuning

Format your dataset with explicit system, user, and assistant roles aligned to the target model's chat template. This is not stylistic preference — it is mechanically required. The tokenizer's apply_chat_template() method inserts special tokens (BOS, EOS, role delimiters) based on these role labels. An incorrect role name or a missing system turn where the model expects one will silently alter which tokens receive loss weight.

Pro Tip: Lock in your chat template before building the full dataset. Print tokenizer.apply_chat_template([{"role": "user", "content": "test"}, {"role": "assistant", "content": "response"}], tokenize=False, add_generation_prompt=False) and confirm the output matches the model card's expected format. Llama 3, Qwen2.5, and Mistral v0.3 all use different delimiter tokens — mixing templates is a silent failure mode that produces a flat loss curve rather than an obvious error.

Map labels, padding, and packing correctly

Mask out prompt tokens from the loss computation so the model only trains on assistant responses. TRL's SFTTrainer handles this automatically when you provide a messages column and set dataset_text_field=None with a chat template, but verify it is doing so by checking that the DataCollatorForCompletionOnlyLM is active.

Watch Out: Accidental label leakage — training on the prompt tokens as well as the completion — can make loss curves look better than they are while producing a model that echoes prompts at inference time. Also, overly long examples (any example exceeding max_seq_length) will be truncated, and if truncation cuts the assistant response, the model trains on a partial completion with no EOS token. Filter your dataset to remove examples where the full formatted length exceeds your max_seq_length before training.

Sequence packing (concatenating multiple short examples into one training sample) improves GPU utilization but requires correct attention masking between packed examples. If using packing, verify that cross-contamination between examples is blocked.

Run the training step and verify that it is actually learning

Launch training through Accelerate or directly via Python. The standard single-GPU invocation:

$ accelerate launch --num_processes 1 --mixed_precision bf16 train.py

After the first forward pass completes, verify that LoRA adapter parameters are the only trainable weights:

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)")
# Expected: ~0.1%–2% trainable for typical LoRA configs on 7B/8B models

If trainable equals total, prepare_model_for_kbit_training() was not called or get_peft_model() was not applied. This will immediately OOM on a 24GB card.

Start with a conservative run configuration

# starter_config.yaml — safe defaults for 7B/8B QLoRA on 24GB
model_name: "meta-llama/Meta-Llama-3.1-8B-Instruct"  # or Qwen/Qwen2.5-7B-Instruct / mistralai/Mistral-7B-Instruct-v0.3
max_seq_length: 512
per_device_train_batch_size: 2
gradient_accumulation_steps: 8           # effective batch = 16
num_train_epochs: 3
learning_rate: 2.0e-4
lr_scheduler_type: "cosine"
warmup_ratio: 0.05
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: "all-linear"
load_in_4bit: true
bnb_4bit_quant_type: "nf4"
bnb_4bit_use_double_quant: true
bnb_4bit_compute_dtype: "bfloat16"
fp16: false
bf16: true
output_dir: "./checkpoints"
save_strategy: "epoch"
logging_steps: 10

These settings are conservative by design. Learning rate 2e-4 is a standard LoRA starting point; too high (above 5e-4) frequently causes loss spikes in the first 100 steps on instruction-tuning datasets. Rank 16 gives adequate adapter capacity for most domain adaptation tasks without excessive memory overhead.

Watch the loss curve and memory telemetry

Pro Tip: In the first 50–100 steps, loss should be noisy but trend downward from its initial value (typically 2.0–3.5 for a well-formatted instruction dataset on a 7B model). A flat loss after 200 steps indicates a configuration error — usually a learning rate that is too low, label leakage masking real gradient signal, or an adapter that is not attached to the correct modules. A loss that explodes past 5.0 within the first 50 steps indicates a learning rate that is too high or a dataset with malformed chat template tokens. Monitor torch.cuda.max_memory_allocated() and log it every 50 steps — if peak memory is growing across epochs, you have a memory leak, often caused by not clearing gradient state or by accumulating tokenized examples in memory.

Troubleshoot the failures that usually burn a 24GB run

Four failure modes account for the majority of broken single-GPU fine-tuning runs: OOM errors, slow steps, poor adapter quality, and Windows-specific instability.

Watch Out: The most time-consuming failure mode is not an OOM — it is a run that completes normally but produces an adapter that does not improve over the base model. This happens when: (1) the dataset uses the wrong chat template, (2) label masking is not applied and the model is training on prompt tokens, or (3) get_peft_model() was called but prepare_model_for_kbit_training() was not, leaving the base model's gradient checkpointing misconfigured. Confirm adapter attachment and label masking before running for more than one epoch.

Diagnose the environment before a long run:

$ python -c "import torch; print(torch.cuda.get_device_name(0), torch.version.cuda)"
$ python -c "import bitsandbytes as bnb; print(bnb.__version__)"
$ nvidia-smi --query-gpu=memory.total,memory.free --format=csv

Fix out-of-memory errors without shrinking quality too much

Production Note: When an OOM occurs, apply reductions in this priority order: (1) halve max_seq_length first — this recovers the most memory for the least quality cost; (2) reduce per_device_train_batch_size to 1; (3) enable or verify gradient_checkpointing=True; (4) reduce LoRA rank from 16 to 8 only as a last resort. Rank reduction directly limits adapter expressivity; sequence length reduction is usually a much smaller quality penalty unless your task genuinely requires long context. If the run still does not fit after all of these, switch from plain BF16 LoRA to QLoRA with NF4.

If gradient_checkpointing=True is not already active, enabling it trades recomputation for memory and typically recovers 3–5GB on 7B/8B models at moderate sequence lengths.

Know when plain LoRA beats QLoRA

Dimension	Plain BF16 LoRA	QLoRA (NF4 + double quant)
Training speed	Faster — no dequantization overhead	Slower — base weights dequantized each forward pass
VRAM at 7B/8B	~16–18GB (weights + optimizer + adapter)	~8–10GB (weights + optimizer + adapter)
Adapter quality	Slightly higher — base model in full precision	Marginally lower — base in 4-bit; practically negligible on most tasks
Best fit	Model + states fit within 24GB at your seq length	Tight budget, longer sequences, or 12GB card
Recommended when	`max_seq_length ≤ 1024`, batch size ≥ 2	`max_seq_length > 1024` or VRAM is the binding constraint

"LoRA reduces the number of trainable parameters, speeds up finetuning, and uses less memory" — this is accurate for both variants, but plain LoRA delivers faster iteration when memory is not the bottleneck. Use QLoRA specifically when VRAM is the binding constraint.

Export, merge, and test the adapter safely

Save the adapter and, optionally, the tokenizer:

$ python -c "
from peft import AutoPeftModelForCausalLM
import torch

model = AutoPeftModelForCausalLM.from_pretrained(
    './checkpoints/final',
    torch_dtype=torch.bfloat16,
    device_map='auto',
)
model.save_pretrained('./adapter_only')  # saves adapter weights only (~50–200MB)
"

To merge adapter weights into the base model for simpler deployment:

merged = model.merge_and_unload()   # merges LoRA weights into base, removes adapter overhead
merged.save_pretrained("./merged_model", safe_serialization=True)

Merging produces a standard HuggingFace model directory that any inference framework (vLLM, llama.cpp, TGI) can load without PEFT installed.

Test the tuned model on your domain prompts

Run at least 20–30 held-out prompts that represent real deployment traffic before declaring the fine-tune successful. Check three properties independently: tone consistency (does the model respond in the intended register?), instruction following (does it complete the task as specified?), and factual accuracy within your domain (does it hallucinate domain-specific terms or values that were not in training data?).

Pro Tip: A model can overfit stylistically — matching the tone of your training data — while performing worse than the base model on task accuracy. Use a held-out evaluation set that was never in the training split, and score it both qualitatively (manual inspection of 10 examples) and quantitatively (exact-match or ROUGE-L on structured tasks). If task accuracy regresses, your dataset is too small, your rank is too high relative to dataset size, or your learning rate ran too many epochs.

Decide whether to keep the adapter separate or merge weights

Keep the adapter separate if you need fast rollback, plan to A/B test multiple adapters against the same base, or want minimal deployment storage. Adapter files for a rank-16, all-linear config on a 7B model are typically under 200MB — trivial to version and swap.

Watch Out: Merging weights too early in the development cycle locks you into a single checkpoint and makes A/B testing expensive. Wait until the adapter has been validated on held-out data before merging. Also note that if you used DoRA during training, PEFT recommends merging before inference specifically because DoRA's magnitude-direction decomposition adds inference overhead that disappears post-merge. For standard LoRA and QLoRA adapters, both deployment modes are equally valid.

Questions readers usually ask after the first run

At a Glance — FAQ:

Can you fine-tune Llama 3 on a 24GB GPU?

Yes. Llama 3.1 8B Instruct fine-tunes on a single 24GB card with either plain BF16 LoRA (at max_seq_length ≤ 1024) or QLoRA NF4 (at longer sequences). The 70B variant is not viable on 24GB.

Is QLoRA better than LoRA for 24GB VRAM?

Depends on your bottleneck. QLoRA uses roughly half the VRAM of plain BF16 LoRA on 7B/8B models, but trains slower due to dequantization overhead. If BF16 LoRA fits at your target sequence length, plain LoRA is faster. If you are pushing longer sequences or working on a 12GB card, QLoRA is the correct choice.

What GPU do I need to fine-tune Mistral or Qwen?

24GB is the practical sweet spot for Mistral 7B Instruct v0.3 and Qwen2.5 7B Instruct at reasonable sequence lengths. A 12GB card (RTX 3080/4070) can work under strict QLoRA settings with max_seq_length ≤ 512 and batch size 1, but provides no room for error.

How much RAM do I need for LoRA fine-tuning?

32GB system RAM is the minimum for loading model weights, dataset, and the training process simultaneously. 64GB is more comfortable if your dataset exceeds a few hundred thousand examples.

What is the best `target_modules` for QLoRA?

target_modules="all-linear" is the recommended default. It covers all linear projection layers across all three model families (Llama 3, Qwen2.5, Mistral) without requiring architecture-specific layer enumeration, and matches the QLoRA paper's experimental setup. Per the PEFT LoRA docs: "To apply LoRA to all the linear layers, like in QLoRA, set target_modules=\"all-linear\"."

Sources & References

Pro Tip: Use the official PEFT LoRA developer guide and the QLoRA paper as the primary implementation references for LoRA, NF4, double quantization, rsLoRA, DoRA, and target_modules="all-linear".

Hugging Face PEFT LoRA Developer Guide (v0.13.0) — canonical reference for target_modules, NF4 recommendations, rsLoRA, DoRA, and LoftQ initialization options
QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al. (arXiv:2305.14314) — source of NF4, double quantization, paged optimizer claims, and the 65B/48GB benchmark
Hugging Face PEFT Repository — integration documentation for Transformers and Accelerate
PEFT Package Reference: LoRA (v0.10.0) — DoRA merge recommendation and adapter variant API surface
Unsloth Official Site — source of the 2× faster training claim and optional acceleration path
Meta AI — Llama 3 — Llama 3 model family and checkpoint documentation
Qwen2.5 Documentation — Qwen2.5 model family and chat template specifications
Mistral AI — Mistral 7B Instruct v0.3 model documentation

Keywords: Llama 3 8B Instruct, Llama 3.1 8B Instruct, Qwen2.5 7B Instruct, Mistral 7B Instruct v0.3, Hugging Face Transformers, PEFT v0.13.0, bitsandbytes, Accelerate, TRL, LoRA, QLoRA, NF4, double quantization, Unsloth, Axolotl

Was this guide helpful?

Share: X · LinkedIn · Reddit