Why Full Fine-Tuning Breaks

A modern LLM like LLaMA-3 70B contains ~70 billion parameters. Full supervised fine-tuning requires storing the model weights, gradients, and optimizer states (Adam stores two moment vectors per parameter). In fp32 that means roughly 840 GB of GPU memory — far beyond even the largest single-GPU setups.

Full Fine-Tuning

  • All ~70B parameters updated each step
  • Optimizer states: 2× model size (Adam)
  • Multi-hundred GB VRAM requirement
  • Risk of catastrophic forgetting
  • Separate checkpoint per task (huge storage)
  • Impractical on consumer hardware

LoRA Fine-Tuning

  • Only adapter matrices trained (~0.1–1% of params)
  • Base model weights frozen — no optimizer states for them
  • Fits on a single 24 GB GPU for 7B models
  • Base knowledge preserved; adapters are modular
  • Tiny adapter files (~10–50 MB) per task
  • Runs efficiently on consumer GPUs
“The intrinsic dimensionality of the task-specific update is surprisingly low. We can decompose the weight change into a product of two small matrices — and lose almost nothing.” — Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021)

The Core Mathematics

In a standard transformer layer, a weight matrix W ∈ ℝd×k is updated during fine-tuning as W’ = W + ΔW. Full fine-tuning learns this entire ΔW. LoRA constrains it:

// Full fine-tuning update (expensive)
ΔW ∈ ℝd×k → up to d×k trainable params

// LoRA decomposes ΔW into two low-rank matrices
ΔW = B × A

where B ∈ ℝd×r, A ∈ ℝr×k, and r ≪ min(d, k)

// Trainable params: r×d + r×k vs d×k
// For d=4096, k=4096, r=4: ~32K vs ~16.7M

The key insight is rank decomposition: if the true task-adaptation signal lives in a low-dimensional subspace, then ΔW can be well-approximated by a low-rank matrix. Empirically, this holds for a wide range of NLP tasks.

Initialization & Scaling

During training, A is initialized with random Gaussian noise, and B is initialized to all zeros — so ΔW = BA = 0 at the start, meaning the model begins exactly as the pretrained base. The output is scaled by α/r, where α is a tunable hyperparameter:

// Forward pass with LoRA
h = W₀x + ΔWx
h = W₀x + BAx × (α / r)

// W₀ is frozen (no gradient flows through it)
// Only A and B accumulate gradients
// α controls the magnitude of the adaptation

Architecture: What Gets Adapted

Transformer layer with LoRA adapters
Base Model (frozen)
Token Embedding
LayerNorm
W_Q d×d
W_K d×d
W_V d×d
W_O d×d
FFN · MLP
LayerNorm
LoRA Adapters (trainable)
A_Q (r×d) + B_Q (d×r)
A_K (r×d) + B_K (d×r)
A_V (r×d) + B_V (d×r)
A_O (r×d) + B_O (d×r)
Merged Output
h = (W + BA)x

Same inference cost
as base model!

LoRA adapters are typically applied to the query and value projection matrices in self-attention (W_Q, W_V). The original paper showed this is often sufficient; many practitioners now also adapt W_K, W_O, and sometimes feed-forward layers.

Choosing the Rank r

The rank r is the single most important hyperparameter. It controls the expressivity–efficiency tradeoff:

Rank vs trainable parameters (d=k=4096, LLaMA-style)
r = 1
8K
r = 4
32K
r = 8
65K
r = 16
131K
r = 64
524K
Full FT
16.7M

Per weight matrix. A 7B model has ~224 such matrices in attention layers.

Practical guidance

For most instruction-following and domain-adaptation tasks, r = 4 to r = 16 works well. Complex reasoning tasks or tasks requiring significant behavioral shifts may benefit from r = 32 or r = 64. Beyond that, you approach full fine-tuning and lose the efficiency benefits without significant quality gain.

PEFT Family: LoRA vs Alternatives

Method Trainable Params Inference Overhead Key Idea Best For
LoRA ~0.1–1% None (merge) Low-rank ΔW decomposition General purpose, production
QLoRA ~0.1–1% +dequant cost LoRA + 4-bit NormalFloat quantization Consumer GPU, 65B models on 48GB
Prefix Tuning ~0.1% +prefix tokens Prepend trainable prefix tokens Generation tasks, NLG
Prompt Tuning <0.01% +soft tokens Learnable soft prompt embeddings Very large models (>10B)
Adapter Layers ~0.5–3% +sequential bottleneck Small FFN inserted after each layer Multi-task learning
IA³ <0.1% Minimal Scale activations with learned vectors Few-shot settings
DoRA ~0.1–1% None (merge) Decomposes into magnitude + direction Improved over LoRA on some tasks

Implementing LoRA in Practice

The Hugging Face PEFT library makes LoRA nearly plug-and-play. Below is a minimal working example for fine-tuning a Mistral-7B model:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

# Load base model (frozen by default in PEFT)
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Define LoRA config
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,            # rank
    lora_alpha=32,   # scaling: alpha/r = 2.0
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)

# Wrap model — only adapter params require grad
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 7,262,244,864 (0.29%)

# After training, merge and unload for zero-latency inference
model = model.merge_and_unload()

QLoRA: Pushing Further

QLoRA (Dettmers et al., 2023) adds 4-bit quantization of the frozen base model using a new data type called NormalFloat4 (NF4), which is information-theoretically optimal for normally distributed weights. The adapters themselves remain in bf16/fp32. This enables fine-tuning a 65B parameter model on a single 48 GB A100.

// QLoRA memory breakdown for LLaMA-65B
Base model weights (NF4): ~32.5 GB
LoRA adapter weights (bf16): ~0.7 GB
Optimizer states (fp32): ~2.1 GB (adapters only)
Activations: ~6.0 GB
Total: ~41.3 GB ← fits in 48GB A100

// vs Full FT in bf16: ~780 GB (≈16× A100s)

Merging & Deployment

One of LoRA’s elegances is that at inference time, the adapter can be merged back into the base weights: W’ = W + BA. The merged model is mathematically identical to a full fine-tuned model with the same ΔW — but was far cheaper to produce. No extra inference latency, no architectural changes.

Alternatively, adapters can be kept separate and hot-swapped at runtime — enabling a single GPU server to host one base model and switch between dozens of specialized task adapters, each ~10–50 MB, with minimal overhead.

Deployment patterns
Merged (Single Model)
W’ = W + BA stored as one checkpoint. No overhead. Best for single-task production. Checkpoint ~same size as base model.
Adapter Serving
Base model shared in memory. Adapters loaded per-request. Enables multi-tenant model serving. Small per-adapter storage (~10 MB).

Hyperparameter Guide

ParameterTypical RangeEffect
r (rank) 4, 8, 16, 32, 64 Controls expressivity. Start with 8 or 16. Lower saves memory; higher may improve complex tasks.
lora_alpha 16–64 (often 2×r) Scales adapter outputs by α/r. Higher α amplifies adapter influence. Common heuristic: α = 2r.
lora_dropout 0.0–0.1 Regularization. Use 0.05 for small datasets; 0.0 for large datasets.
target_modules q_proj, v_proj (min); all attention (more) Which weight matrices to adapt. More modules = more params = better quality at higher cost.
Learning rate 1e-4 to 3e-4 Higher than full FT is usually safe since adapter weights are small. Use cosine schedule.

PEFT LoRA QLoRA LLM Fine-Tuning Low-Rank Adaptation Transformer Hugging Face
Key papers — Hu et al. (2021) “LoRA: Low-Rank Adaptation of Large Language Models” · Dettmers et al. (2023) “QLoRA: Efficient Finetuning of Quantized LLMs” · Ding et al. (2023) “Delta Tuning: A Comprehensive Study of Parameter Efficient Methods”