OpenAlice Academy — 02 · 02 / LoRA & PEFT

01 / 07

Intuition first · the sculpture

Don't re-carve the statue. Bolt on an attachment.

Picture the pretrained model as a finished sculpture. Full fine-tuning hands you a chisel and lets you re-carve the whole statue — powerful, but you need a fresh block of marble for every variation. LoRA leaves the statue untouched and crafts a small custom attachment.

MEMORY

~3× less GPU

You store optimizer state and gradients only for the tiny attachment. The LoRA paper reports a roughly 3× reduction in GPU memory vs. full fine-tuning of GPT-3 175B, and ~10,000× fewer trainable parameters.

SHIPPING ★

megabytes, not gigs

A LoRA adapter for a 7B model is often just a few MB. Keep one frozen base and a whole library of swappable adapters — one per customer, persona, or task.

SPEED

no inference tax

Unlike older adapter layers that add latency, a trained LoRA adapter merges back into W (W' = W₀ + BA). Deployed inference runs at the exact speed of the original model.

PEFT — Parameter-Efficient Fine-Tuning — is the umbrella for this whole family. LoRA is its most popular member, sitting alongside prompt tuning, prefix tuning, classic adapters, and IA³. LoRA won the popularity contest because it is simple, mergeable, and competitive in quality.

02 / 07

The core equation · drag the rank

A wide matrix from two thin ones.

A full fine-tune learns a full-size update ΔW the same shape as the weight. LoRA bets that update is low-rank — so it factors it: ΔW = B·A. Drag the rank r and watch the trainable parameter count collapse.

FIG.02 — ΔW (d×k) ≈ B (d×r) · A (r×k) · drag r

W₀ frozen // d×k, no gradients ΔW = B·A // B is d×r, A is r×k, r ≪ d,k h = W₀x + (α/r)·B·A·x

r = 1 r = 128

RANK r8

FULL ΔW PARAMS (d·k)—

LoRA PARAMS r·(d+k)—

REDUCTION PER MATRIX—

For a 4096×4096 layer at r=8 that's 65,536 trainable params versus 16,777,216 — a 256× reduction per matrix. The whole bet: small r suffices, because the fine-tuning update lives in a small subspace.

03 / 07

Initialization · the safe start

It starts as a perfect no-op.

LoRA initializes A with random Gaussian noise and B with zeros. So at step 0, ΔW = B·A = 0·A = 0 — the adapter contributes nothing, and the model behaves exactly like the pretrained base. Press play and watch training move it off zero.

FIG.03 — TRAINING STEPS · B leaves zero, ΔW comes alive

step 0

WHY ZEROS MATTER

If both A and B were random, the adapter would inject noise on top of a carefully pretrained model from the very first step — a destabilising kick. Starting from exactly the base model means training can only improve from a known-good point.

This is why peft's default init_lora_weights=True means "B = 0". The α/r scaling factor (lora_alpha) then lets you dial the adapter's strength after training without redoing it — a common default is α = 2r.

04 / 07

Where the savings live · play with it

Full fine-tune vs. LoRA vs. QLoRA.

Almost all the savings come from not storing gradients + optimizer state for the frozen weights. Pick a model size and a rank and watch the GPU memory budget for the three approaches — these are illustrative back-of-envelope estimates, not vendor benchmarks.

MODEL SIZE

7B 13B 33B 65B

r = 4 r = 64

MODEL13B params

BASE WEIGHTS (16-bit)—

BASE WEIGHTS (4-bit · QLoRA)—

TRAINABLE (LoRA, r=16)—

Full FT carries gradients + Adam moments for every weight (~16 bytes/param). LoRA carries them only for the thin adapter. QLoRA additionally squeezes the frozen base to 4 bits.

FIG.04 — PEAK TRAINING MEMORY · est. GB

FULL FINE-TUNE—

LoRA · 16-bit base—

QLoRA · 4-bit base—

frozen base weights trainable + optimizer state

05 / 07

QLoRA · the 4-bit base trick

Squeeze the frozen weights to 4 bits.

LoRA freezes W₀, but W₀ still has to live in GPU memory — 16-bit, ~130 GB for a 65B model. QLoRA shrinks the frozen base to 4 bits while keeping the trainable adapters full-precision, letting you fine-tune a 65B model on a single 48 GB GPU.

① NF4

4-bit NormalFloat

Weights are roughly normally distributed. NF4 places its 16 quantization levels for equal probability mass per bin — information-theoretically optimal for normal data, so it loses less than naive 4-bit int.

② DOUBLE-QUANT

quantize the constants

Quantization needs per-block scaling constants. QLoRA quantizes the quantization constants too, shaving another fraction of a bit per parameter on average.

③ PAGED OPT

spill to CPU RAM

Using NVIDIA unified memory, optimizer state is paged between GPU and CPU to absorb memory spikes (long sequences) that would otherwise OOM-crash the run.

FIG.05 — THE QLoRA FORWARD PASS · dequantize-on-the-fly

h = dequantize(W₀^NF4)·x + (α/r)·B·A·x

The forward pass dequantizes the 4-bit base on the fly to compute W₀·x, but gradients flow only into A and B, which stay in bf16. The subtle, beautiful part: the base is lossy-4-bit, yet the trainable adapters are full-precision — so the model can learn to compensate for the quantization error during training. QLoRA's Guanaco models trained in 24 hours on one GPU and reached 99.3% of ChatGPT's level on the Vicuna benchmark (Dettmers et al., 2023).

06 / 07

One base · many specializations · live

Hot-swap behavior. Same base.

A frozen base plus N tiny adapters is the architecture behind "an adapter per customer / per skill / per persona." Click an adapter to merge it into the base and watch the behavior change — no second copy of the model needed.

FIG.06 — FROZEN BASE + SWAPPABLE ADAPTER · W' = W₀ + (α/r)·B·A

ADAPTER LIBRARY — pick one to mount

base only0 MB legal-assistant9 MB poet7 MB code-helper12 MB russian-tutor8 MB

SAME PROMPT → DIFFERENT BEHAVIOR

MODELS IN MEMORY1 base

MOUNTED ADAPTERnone

EXTRA DISK0 MB

INFERENCE LATENCYbaseline (merged)

Adapters even compose — peft supports linear, cat, ties, dare, and svd combinations — opening the door to model arithmetic: blend two skills, or subtract an unwanted behavior.

07 / 07

The PEFT family · and the honest caveats

The family, and where it bites.

LoRA is one point in a larger design space. They differ in what gets trained and whether you can merge it away.

Method	What gets trained	Core idea	Mergeable?
Full fine-tune	all weights	re-train everything	n/a
LoRA	low-rank A,B per layer	low-rank update W₀+BA	✅ yes
QLoRA	LoRA A,B; base is 4-bit	LoRA on a quantized base	✅ after dequant
DoRA	magnitude m + LoRA on direction	split weight into size + direction	✅ yes
Prefix / Prompt	a few "virtual token" vectors	steer via learned soft prompts	❌ changes input
Adapters (Houlsby)	small bottleneck MLPs	insert trainable bottleneck modules	❌ adds layers
IA³	per-feature scaling vectors	rescale activations with learned vectors	✅ cheap rescale

Two refinements worth knowing: rsLoRA proves the stable scaling is α/√r (not α/r), which keeps learning from stalling at high rank. DoRA splits each weight into a magnitude scalar and a unit-norm direction, applying LoRA only to the direction — better accuracy especially at low rank, with no extra inference cost once merged.

"ON PAR WITH FULL FT" IS NOT A LAW

LoRA has a capacity ceiling — the update is constrained to rank r. For tasks needing broad new knowledge, low rank can underfit and you'll see a gap. The rank-deficiency hypothesis is empirical, observed on certain models/tasks — not proven in general.

Research even argues the equivalence can be an illusion: LoRA can introduce "intruder dimensions" — singular directions absent from full fine-tuning — that hurt out-of-distribution generalization and worsen forgetting, even when in-task accuracy matches. Matching accuracy ≠ matching the underlying solution.

WHEN TO REACH FOR EACH

LoRA / QLoRA when you're budget-constrained, need many specializations off one base, want a small shippable artifact, or are adapting style/skill. Full fine-tuning when you're injecting a large amount of new knowledge, have the compute, and every last quality point matters.

Hyperparameter sensitivity is real: quality depends on r, α, target_modules, learning rate, and the init scheme (PiSSA / LoftQ / EVA / LoRA-GA). There's no universal recipe — expect to sweep. QLoRA's 4-bit base is also genuinely lossy; for the highest-fidelity work, 16-bit LoRA or full FT can edge it out.

02 · 02 — you made it

You can fine-tune
a giant.

The low-rank correction. The no-op init. The memory math. QLoRA's 4-bit base, the swappable adapter library, and the whole PEFT family — plus the honest places it bites. A frozen base and a thin trainable correction is how the entire open-weights ecosystem ships task-specific models. You now hold the cheap lever.

02·01 RLHF & alignment · teaching a model what humans prefer ✓ prev

02·02 LoRA & PEFT · cheap fine-tuning · low-rank, QLoRA, the family ✓ complete

02·06 Mixture-of-Experts · don't activate the whole model — route to a few experts next

01·06 Scaling laws · how much bigger, how much better, for how much compute related

Next · 02 · 06