openalicelabs / academy
COURSE ARCH-02 LESSON 02 · 03 TOPIC QUANTIZATION EST. READ ~12 MIN
OPENALICE LABORATORIES · EDUCATION PATH · EFFICIENCY 02 · 03

140 GB,
onto a
laptop.

A 70-billion-parameter model in native 16-bit weights needs ~140 GB just to hold the numbers — more than any consumer GPU has. Quantization stores each weight in 4 bits instead of 16, turning that 140 GB into ~40 GB, with surprisingly small quality loss. It is the single technique that put usable LLMs on hardware you actually own.

FIG.00 — 16 BITS → 4 BITS
loading…
FIG.0A — THE WHOLE PRIMITIVE · w ≈ scale · (q − zero) · q is a tiny integer

Continuous 16-bit weights (top dots) get snapped to the nearest of just 16 allowed levels — a 4-bit grid. Store only the level index (4 bits) plus one scale per block. The network barely notices, because it never depended on the 11th decimal of any single weight.

INPUTa trained model — billions of FP16 / BF16 weights
OUTPUTthe same model in 8 / 4 / 3 / 2 bits per weight
FP16 → INT4~4× smaller, ~4× less memory traffic
RULE OF THUMBQ4_K_M ≈ 4× smaller for ~0.5% benchmark loss
KINDpost-training — no gradient descent, minutes-to-hours
01 / 07
The one-sentence idea · intuition first

Fewer bits — but which rounding?

Store each weight with fewer bits, and choose those bits so the model's outputs barely change — not so the weights themselves barely change. Almost every clever method is a different answer to "fewer bits, but which rounding minimizes the output error?"

WHY IT WORKS

networks are robust

Each weight is one tiny vote in a sum over thousands. Nudge any one to its nearest allowed value and the noise partly cancels — like an 8-bit photo looking identical to a 24-bit one.

THE PRIMITIVE ★

round-to-nearest

Find a block's range, carve it into N bins (16 for 4-bit), snap every weight to its nearest bin. Keep the bin index + a scale & zero-point to reconstruct.

THE CATCH

naive RTN sags

Plain round-to-nearest at 4 bits noticeably degrades large LLMs. GPTQ, AWQ and GGUF's k-quants are all ways to be smarter than nearest so 4-bit (even 3-bit) stays usable.

// store q = round( (w − zero) / scale ) // tiny int, 4 bits // reconstruct w_approx = scale · q + zero // an approximation

That is the entire low-level mechanism. One scale per block of weights (often 128 of them), plus a 4-bit index each. Everything else in this lesson is about choosing q more cleverly than "nearest".

THE THREE PAYOFFS

Memory — the binding constraint; FP16→int4 is a ~4× cut, the difference between "needs 2× A100" and "fits one card". Speed — inference is memory-bandwidth-bound, so quarter the bytes ≈ quarter the traffic. Reach — GGUF + llama.cpp run an 8B on a Raspberry Pi, a ThinkPad, a phone.

02 / 07
Round-to-nearest · drag the precision

Watch weights snap to a grid.

A block of continuous weights, and a grid of allowed levels. Drag the bit-width and watch the grid coarsen, every weight jump to its nearest level, and the rounding error grow. This is round-to-nearest, the primitive every method improves on.

FIG.02 — RTN ON ONE BLOCK · TRUE WEIGHT vs SNAPPED VALUE
2-bit 8-bit

BITS PER WEIGHT
ALLOWED LEVELS
STORAGE vs FP16
MEAN ROUNDING ERROR

Halve the bits and you halve the storage but double the grid spacing — so the worst-case rounding error doubles too. The accent stems show how far each weight was nudged. The whole field is a fight to keep that error small at 4 bits.

03 / 07
The shared objective · weight error vs output error

Minimize the output, not the weights.

Every serious method quantizes one linear layer at a time and asks: keep W·X close, not W close. Some weights matter far more — because the activations that multiply them are large. Drag the activation on the big weight and watch which rounding choice the smart method makes.

FIG.03 — SAME ROUNDING, DIFFERENT ACTIVATION · OUTPUT ERROR FLIPS
w·x for two weights · |w·x − ŵ·x| is what matters
small x large x
// RTN minimizes weight error argmin ‖ W − Ŵ// GPTQ / AWQ minimize OUTPUT error argmin ‖ W·XŴ·X ‖²

The same 0.1 rounding error on a weight is harmless if its activation is tiny — and catastrophic if its activation is huge. Spend precision where it changes the output most.

GPTQ goes column-by-column: each time it quantizes a weight it nudges the remaining weights to compensate, using the activation curvature (the Hessian H = 2·X·Xᵀ). By the last column, every earlier mistake has been absorbed — error-correcting rounding. A 175B model quantizes to 3–4 bits in ~4 GPU-hours.

04 / 07
AWQ · protect salient weights by scaling

Scale it up, then round it.

AWQ's insight: only ~1% of weights matter, and you spot them by the activations, not the weights. Its trick avoids ugly mixed precision — scale a salient weight up by s before quantizing and scale its activation down by 1/s. The product is unchanged, but the larger weight uses more of the grid, so its relative rounding error shrinks. Sweep the scale.

FIG.04 — A SALIENT WEIGHT ON THE 4-BIT GRID · BIGGER = SHARPER
α=0 · plain RTN α=1 · full scale
Q(w)·x → Q(w·s)·(x / s) s > 1 s = s_X α* = argmin ‖ Q(W·s)·(X/s) − W·X ‖
SCALE s = s_X^α
GRID SPACING (effective)
RELATIVE ROUNDING ERROR

No backprop, no Hessian — just activation statistics and a 1-D search over the single exponent α. Faster than GPTQ, and it generalizes better to instruction-tuned and multimodal models because it doesn't overfit the calibration set. It's the common default in vLLM / TensorRT-LLM serving.

05 / 07
GGUF + llama.cpp · k-quants for CPU / edge

Decode Q4_K_M.

GGUF is a file format; k-quants are its scheme — a different lineage, built for CPU and mixed CPU/GPU inference. It groups weights into super-blocks of 256, gives each sub-block its own tiny scale, then quantizes those scales too. Hover a row below — real Llama-3.1-8B numbers.

READING THE NAME

Q4_K_M = 4-bit · K-quant (two-level scaling) · Medium mix. The _S / _M / _L suffix says how aggressively the sensitive tensors (embeddings, lm_head) are kept at higher bit-width while the bulk feed-forward goes low.

imatrix ("importance matrix") is an optional calibration pass that — like AWQ/GPTQ — weights the quantization by which weights matter to real text. It rescues the low-bit quants (Q2/Q3) and is now standard for good GGUF releases.

Down to ~4–5 bits the loss is fractions of a percent. Below ~3 bits it falls off a cliff — reasoning and math degrade first.

FIG.05 — Llama-3.1-8B · llama.cpp · the accuracy ⇄ size tradeoff
SchemebpwSizeΔsizePerplexityBench loss
06 / 07
Make it concrete · will it fit?

A real memory calculator.

Memory is the binding constraint. Pick a model size and a quantization, and watch the weights shrink against the hardware you might actually have — including the blal.de box (62 GB RAM, no GPU).

FIG.06 — WEIGHT MEMORY vs HARDWARE · DOES IT FIT?
MODEL PARAMETERS
8B
QUANTIZATION
PARAMETERS
BITS / WEIGHT
WEIGHT MEMORY
vs FP16

Weights only — real serving also needs the KV-cache + activations, so leave headroom. On a CPU box GGUF is runnable, not fast: "runs on my laptop" and "serves production traffic" are different bars.

07 / 07
The three methods · and the honest caveats

The family, and what to actually do.

Three lineages, one goal. They differ in how they decide the rounding.

MethodIdeaNeedsBest atUsed by
GPTQerror-correcting rounding via the Hessian; nudge later columns to absorb earlier errorsGPU + calibration dataraw perplexity, big-GPU servingmany 4-bit HF checkpoints
AWQscale up salient weights (found via activations) before rounding; no backpropcalibration data onlyinstruction-tuned & multimodalvLLM, TensorRT-LLM defaults
GGUF k-quantstwo-level block scales + per-tensor mixed precision; optional imatrixllama.cpp, CPU-friendlyCPU / edge / "runs everywhere"llama.cpp, the local-LLM world
"NEAR-LOSSLESS" IS BENCHMARK-NEAR-LOSSLESS

A 0.5% average drop hides that quantization degrades unevenly: long-context retrieval, multi-step math, and rare-token handling suffer first. Q3_K_S collapsed GSM8K reasoning by 9.3 points in one study. Always eval on your task — perplexity is a weak proxy.

Same theme: the outlier problem is the central villain — a handful of feature dimensions have huge magnitudes that blow up the quantization range. AWQ's scaling, GGUF's per-sub-block scales, and GPTQ's error feedback are all ways of taming outliers.

WHAT TO REACH FOR

Q4_K_M is the default — ~4× smaller, ~0.5% loss, use unless you have a reason not to. Q5_K_M / Q6_K when you have the RAM and want near-lossless. Q8_0 when memory isn't the constraint (quality basically free). Q3 / Q2 only when desperate, and only with an imatrix.

Two more axes: this is all weight-only PTQ — activations stay FP16. Quantizing activations too (W8A8, SmoothQuant, FP8) is harder but unlocks compute, not just memory. And quantize-then-LoRA = QLoRA: fine-tune on frozen 4-bit weights.

02 · 03 — you made it

You shrank
a model.

The grid and round-to-nearest. Output error, not weight error. GPTQ's error correction, AWQ's scaling, GGUF's k-quants, and a memory calculator that says whether it fits. You now know how a 140 GB model lands on hardware you own. You hold the lever that democratized local LLMs.

02·01 RLHF & Alignment · reward models, PPO, DPO ✓ done
02·02 LoRA & PEFT · fine-tune billions by training a few soon
02·03 Quantization · FP16 → 4-bit · GPTQ, AWQ, GGUF ✓ complete
02·04 Scaling Laws · the math that predicts performance from compute next
Next · 02 · 04

Scaling Laws →

You can now run big models small. Next: the math that predicts exactly how good a model will be from its compute, data and parameter budget.

openalicelabs