openalicelabs / academy
COURSE TRAIN-02 LESSON 02 · 02 TOPIC LoRA & PEFT EST. READ ~13 MIN
OPENALICE LABORATORIES · EDUCATION PATH · TRAINING 02 · 02

Fine-tune a
giant on a
tiny budget.

Full fine-tuning a 7B model means storing gradients and optimizer state for all seven billion weights — tens of gigabytes of GPU memory, a fresh multi-gigabyte checkpoint per task. LoRA freezes the original weights and learns a tiny, low-rank correction instead — often under 1% of the model — for ~the same quality.

FIG.00 — ΔW = B·A
loading…
FIG.0A — THE ONE IDEA · freeze W₀ · learn a thin correction ΔW = B·A

The big frozen matrix W₀ never changes. Bolted beside it, two skinny matrices B and A form a low-rank update. Only those skinny matrices are trained — and afterwards you can fold them right back into the weights.

FROZENW₀ — the entire pretrained model, no gradients
TRAINEDA & B — a thin low-rank pair per layer
RANK rthe one big knob · typically 4–16
ADAPTER SIZEoften a few megabytes for a 7B model
INFERENCE TAXzero — merge W' = W₀ + BA after training
01 / 07
Intuition first · the sculpture

Don't re-carve the statue. Bolt on an attachment.

Picture the pretrained model as a finished sculpture. Full fine-tuning hands you a chisel and lets you re-carve the whole statue — powerful, but you need a fresh block of marble for every variation. LoRA leaves the statue untouched and crafts a small custom attachment.

MEMORY

~3× less GPU

You store optimizer state and gradients only for the tiny attachment. The LoRA paper reports a roughly 3× reduction in GPU memory vs. full fine-tuning of GPT-3 175B, and ~10,000× fewer trainable parameters.

SHIPPING ★

megabytes, not gigs

A LoRA adapter for a 7B model is often just a few MB. Keep one frozen base and a whole library of swappable adapters — one per customer, persona, or task.

SPEED

no inference tax

Unlike older adapter layers that add latency, a trained LoRA adapter merges back into W (W' = W₀ + BA). Deployed inference runs at the exact speed of the original model.

PEFT — Parameter-Efficient Fine-Tuning — is the umbrella for this whole family. LoRA is its most popular member, sitting alongside prompt tuning, prefix tuning, classic adapters, and IA³. LoRA won the popularity contest because it is simple, mergeable, and competitive in quality.

02 / 07
The core equation · drag the rank

A wide matrix from two thin ones.

A full fine-tune learns a full-size update ΔW the same shape as the weight. LoRA bets that update is low-rank — so it factors it: ΔW = B·A. Drag the rank r and watch the trainable parameter count collapse.

FIG.02 — ΔW (d×k) ≈ B (d×r) · A (r×k) · drag r
W₀ frozen // d×k, no gradients ΔW = B·A // B is d×r, A is r×k, r ≪ d,k h = W₀x + (α/r)·B·A·x
r = 1 r = 128
RANK r8
FULL ΔW PARAMS (d·k)
LoRA PARAMS r·(d+k)
REDUCTION PER MATRIX

For a 4096×4096 layer at r=8 that's 65,536 trainable params versus 16,777,216 — a 256× reduction per matrix. The whole bet: small r suffices, because the fine-tuning update lives in a small subspace.

03 / 07
Initialization · the safe start

It starts as a perfect no-op.

LoRA initializes A with random Gaussian noise and B with zeros. So at step 0, ΔW = B·A = 0·A = 0 — the adapter contributes nothing, and the model behaves exactly like the pretrained base. Press play and watch training move it off zero.

FIG.03 — TRAINING STEPS · B leaves zero, ΔW comes alive
step 0
WHY ZEROS MATTER

If both A and B were random, the adapter would inject noise on top of a carefully pretrained model from the very first step — a destabilising kick. Starting from exactly the base model means training can only improve from a known-good point.

This is why peft's default init_lora_weights=True means "B = 0". The α/r scaling factor (lora_alpha) then lets you dial the adapter's strength after training without redoing it — a common default is α = 2r.

04 / 07
Where the savings live · play with it

Full fine-tune vs. LoRA vs. QLoRA.

Almost all the savings come from not storing gradients + optimizer state for the frozen weights. Pick a model size and a rank and watch the GPU memory budget for the three approaches — these are illustrative back-of-envelope estimates, not vendor benchmarks.

MODEL SIZE
7B 13B 33B 65B
r = 4 r = 64
MODEL13B params
BASE WEIGHTS (16-bit)
BASE WEIGHTS (4-bit · QLoRA)
TRAINABLE (LoRA, r=16)

Full FT carries gradients + Adam moments for every weight (~16 bytes/param). LoRA carries them only for the thin adapter. QLoRA additionally squeezes the frozen base to 4 bits.

FIG.04 — PEAK TRAINING MEMORY · est. GB
FULL FINE-TUNE
LoRA · 16-bit base
QLoRA · 4-bit base

frozen base weights trainable + optimizer state

05 / 07
QLoRA · the 4-bit base trick

Squeeze the frozen weights to 4 bits.

LoRA freezes W₀, but W₀ still has to live in GPU memory — 16-bit, ~130 GB for a 65B model. QLoRA shrinks the frozen base to 4 bits while keeping the trainable adapters full-precision, letting you fine-tune a 65B model on a single 48 GB GPU.

① NF4

4-bit NormalFloat

Weights are roughly normally distributed. NF4 places its 16 quantization levels for equal probability mass per bin — information-theoretically optimal for normal data, so it loses less than naive 4-bit int.

② DOUBLE-QUANT

quantize the constants

Quantization needs per-block scaling constants. QLoRA quantizes the quantization constants too, shaving another fraction of a bit per parameter on average.

③ PAGED OPT

spill to CPU RAM

Using NVIDIA unified memory, optimizer state is paged between GPU and CPU to absorb memory spikes (long sequences) that would otherwise OOM-crash the run.

FIG.05 — THE QLoRA FORWARD PASS · dequantize-on-the-fly
h = dequantize(W₀NF4)·x + (α/r)·B·A·x

The forward pass dequantizes the 4-bit base on the fly to compute W₀·x, but gradients flow only into A and B, which stay in bf16. The subtle, beautiful part: the base is lossy-4-bit, yet the trainable adapters are full-precision — so the model can learn to compensate for the quantization error during training. QLoRA's Guanaco models trained in 24 hours on one GPU and reached 99.3% of ChatGPT's level on the Vicuna benchmark (Dettmers et al., 2023).

06 / 07
One base · many specializations · live

Hot-swap behavior. Same base.

A frozen base plus N tiny adapters is the architecture behind "an adapter per customer / per skill / per persona." Click an adapter to merge it into the base and watch the behavior change — no second copy of the model needed.

FIG.06 — FROZEN BASE + SWAPPABLE ADAPTER · W' = W₀ + (α/r)·B·A
ADAPTER LIBRARY — pick one to mount
SAME PROMPT → DIFFERENT BEHAVIOR
MODELS IN MEMORY1 base
MOUNTED ADAPTERnone
EXTRA DISK0 MB
INFERENCE LATENCYbaseline (merged)

Adapters even composepeft supports linear, cat, ties, dare, and svd combinations — opening the door to model arithmetic: blend two skills, or subtract an unwanted behavior.

07 / 07
The PEFT family · and the honest caveats

The family, and where it bites.

LoRA is one point in a larger design space. They differ in what gets trained and whether you can merge it away.

MethodWhat gets trainedCore ideaMergeable?
Full fine-tuneall weightsre-train everythingn/a
LoRAlow-rank A,B per layerlow-rank update W₀+BA✅ yes
QLoRALoRA A,B; base is 4-bitLoRA on a quantized base✅ after dequant
DoRAmagnitude m + LoRA on directionsplit weight into size + direction✅ yes
Prefix / Prompta few "virtual token" vectorssteer via learned soft prompts❌ changes input
Adapters (Houlsby)small bottleneck MLPsinsert trainable bottleneck modules❌ adds layers
IA³per-feature scaling vectorsrescale activations with learned vectors✅ cheap rescale

Two refinements worth knowing: rsLoRA proves the stable scaling is α/√r (not α/r), which keeps learning from stalling at high rank. DoRA splits each weight into a magnitude scalar and a unit-norm direction, applying LoRA only to the direction — better accuracy especially at low rank, with no extra inference cost once merged.

"ON PAR WITH FULL FT" IS NOT A LAW

LoRA has a capacity ceiling — the update is constrained to rank r. For tasks needing broad new knowledge, low rank can underfit and you'll see a gap. The rank-deficiency hypothesis is empirical, observed on certain models/tasks — not proven in general.

Research even argues the equivalence can be an illusion: LoRA can introduce "intruder dimensions" — singular directions absent from full fine-tuning — that hurt out-of-distribution generalization and worsen forgetting, even when in-task accuracy matches. Matching accuracy ≠ matching the underlying solution.

WHEN TO REACH FOR EACH

LoRA / QLoRA when you're budget-constrained, need many specializations off one base, want a small shippable artifact, or are adapting style/skill. Full fine-tuning when you're injecting a large amount of new knowledge, have the compute, and every last quality point matters.

Hyperparameter sensitivity is real: quality depends on r, α, target_modules, learning rate, and the init scheme (PiSSA / LoftQ / EVA / LoRA-GA). There's no universal recipe — expect to sweep. QLoRA's 4-bit base is also genuinely lossy; for the highest-fidelity work, 16-bit LoRA or full FT can edge it out.

02 · 02 — you made it

You can fine-tune
a giant.

The low-rank correction. The no-op init. The memory math. QLoRA's 4-bit base, the swappable adapter library, and the whole PEFT family — plus the honest places it bites. A frozen base and a thin trainable correction is how the entire open-weights ecosystem ships task-specific models. You now hold the cheap lever.

02·01 RLHF & alignment · teaching a model what humans prefer ✓ prev
02·02 LoRA & PEFT · cheap fine-tuning · low-rank, QLoRA, the family ✓ complete
02·06 Mixture-of-Experts · don't activate the whole model — route to a few experts next
01·06 Scaling laws · how much bigger, how much better, for how much compute related
Next · 02 · 06

Mixture-of-Experts →

LoRA shrinks training cost. MoE shrinks inference cost — route each token to a few experts instead of the whole model.

openalicelabs