Full fine-tuning a 7B model means storing gradients and optimizer state for all seven billion weights — tens of gigabytes of GPU memory, a fresh multi-gigabyte checkpoint per task. LoRA freezes the original weights and learns a tiny, low-rank correction instead — often under 1% of the model — for ~the same quality.
loading…
The big frozen matrix W₀ never changes. Bolted beside it, two skinny matrices B and A form a low-rank update. Only those skinny matrices are trained — and afterwards you can fold them right back into the weights.
Picture the pretrained model as a finished sculpture. Full fine-tuning hands you a chisel and lets you re-carve the whole statue — powerful, but you need a fresh block of marble for every variation. LoRA leaves the statue untouched and crafts a small custom attachment.
You store optimizer state and gradients only for the tiny attachment. The LoRA paper reports a roughly 3× reduction in GPU memory vs. full fine-tuning of GPT-3 175B, and ~10,000× fewer trainable parameters.
A LoRA adapter for a 7B model is often just a few MB. Keep one frozen base and a whole library of swappable adapters — one per customer, persona, or task.
Unlike older adapter layers that add latency, a trained LoRA adapter merges back into W (W' = W₀ + BA). Deployed inference runs at the exact speed of the original model.
PEFT — Parameter-Efficient Fine-Tuning — is the umbrella for this whole family. LoRA is its most popular member, sitting alongside prompt tuning, prefix tuning, classic adapters, and IA³. LoRA won the popularity contest because it is simple, mergeable, and competitive in quality.
A full fine-tune learns a full-size update ΔW the same shape as the weight. LoRA bets that update is low-rank — so it factors it: ΔW = B·A. Drag the rank r and watch the trainable parameter count collapse.
For a 4096×4096 layer at r=8 that's 65,536 trainable params versus 16,777,216 — a 256× reduction per matrix. The whole bet: small r suffices, because the fine-tuning update lives in a small subspace.
LoRA initializes A with random Gaussian noise and B with zeros. So at step 0, ΔW = B·A = 0·A = 0 — the adapter contributes nothing, and the model behaves exactly like the pretrained base. Press play and watch training move it off zero.
If both A and B were random, the adapter would inject noise on top of a carefully pretrained model from the very first step — a destabilising kick. Starting from exactly the base model means training can only improve from a known-good point.
This is why peft's default init_lora_weights=True means "B = 0". The α/r scaling factor (lora_alpha) then lets you dial the adapter's strength after training without redoing it — a common default is α = 2r.
Almost all the savings come from not storing gradients + optimizer state for the frozen weights. Pick a model size and a rank and watch the GPU memory budget for the three approaches — these are illustrative back-of-envelope estimates, not vendor benchmarks.
Full FT carries gradients + Adam moments for every weight (~16 bytes/param). LoRA carries them only for the thin adapter. QLoRA additionally squeezes the frozen base to 4 bits.
LoRA freezes W₀, but W₀ still has to live in GPU memory — 16-bit, ~130 GB for a 65B model. QLoRA shrinks the frozen base to 4 bits while keeping the trainable adapters full-precision, letting you fine-tune a 65B model on a single 48 GB GPU.
Weights are roughly normally distributed. NF4 places its 16 quantization levels for equal probability mass per bin — information-theoretically optimal for normal data, so it loses less than naive 4-bit int.
Quantization needs per-block scaling constants. QLoRA quantizes the quantization constants too, shaving another fraction of a bit per parameter on average.
Using NVIDIA unified memory, optimizer state is paged between GPU and CPU to absorb memory spikes (long sequences) that would otherwise OOM-crash the run.
The forward pass dequantizes the 4-bit base on the fly to compute W₀·x, but gradients flow only into A and B, which stay in bf16. The subtle, beautiful part: the base is lossy-4-bit, yet the trainable adapters are full-precision — so the model can learn to compensate for the quantization error during training. QLoRA's Guanaco models trained in 24 hours on one GPU and reached 99.3% of ChatGPT's level on the Vicuna benchmark (Dettmers et al., 2023).
A frozen base plus N tiny adapters is the architecture behind "an adapter per customer / per skill / per persona." Click an adapter to merge it into the base and watch the behavior change — no second copy of the model needed.
Adapters even compose — peft supports linear, cat, ties, dare, and svd combinations — opening the door to model arithmetic: blend two skills, or subtract an unwanted behavior.
LoRA is one point in a larger design space. They differ in what gets trained and whether you can merge it away.
| Method | What gets trained | Core idea | Mergeable? |
|---|---|---|---|
| Full fine-tune | all weights | re-train everything | n/a |
| LoRA | low-rank A,B per layer | low-rank update W₀+BA | ✅ yes |
| QLoRA | LoRA A,B; base is 4-bit | LoRA on a quantized base | ✅ after dequant |
| DoRA | magnitude m + LoRA on direction | split weight into size + direction | ✅ yes |
| Prefix / Prompt | a few "virtual token" vectors | steer via learned soft prompts | ❌ changes input |
| Adapters (Houlsby) | small bottleneck MLPs | insert trainable bottleneck modules | ❌ adds layers |
| IA³ | per-feature scaling vectors | rescale activations with learned vectors | ✅ cheap rescale |
Two refinements worth knowing: rsLoRA proves the stable scaling is α/√r (not α/r), which keeps learning from stalling at high rank. DoRA splits each weight into a magnitude scalar and a unit-norm direction, applying LoRA only to the direction — better accuracy especially at low rank, with no extra inference cost once merged.
LoRA has a capacity ceiling — the update is constrained to rank r. For tasks needing broad new knowledge, low rank can underfit and you'll see a gap. The rank-deficiency hypothesis is empirical, observed on certain models/tasks — not proven in general.
Research even argues the equivalence can be an illusion: LoRA can introduce "intruder dimensions" — singular directions absent from full fine-tuning — that hurt out-of-distribution generalization and worsen forgetting, even when in-task accuracy matches. Matching accuracy ≠ matching the underlying solution.
LoRA / QLoRA when you're budget-constrained, need many specializations off one base, want a small shippable artifact, or are adapting style/skill. Full fine-tuning when you're injecting a large amount of new knowledge, have the compute, and every last quality point matters.
Hyperparameter sensitivity is real: quality depends on r, α, target_modules, learning rate, and the init scheme (PiSSA / LoftQ / EVA / LoRA-GA). There's no universal recipe — expect to sweep. QLoRA's 4-bit base is also genuinely lossy; for the highest-fidelity work, 16-bit LoRA or full FT can edge it out.
The low-rank correction. The no-op init. The memory math. QLoRA's 4-bit base, the swappable adapter library, and the whole PEFT family — plus the honest places it bites. A frozen base and a thin trainable correction is how the entire open-weights ecosystem ships task-specific models. You now hold the cheap lever.
LoRA shrinks training cost. MoE shrinks inference cost — route each token to a few experts instead of the whole model.