OpenAlice Academy — 02 · 03 / Quantization

01 / 07

The one-sentence idea · intuition first

Fewer bits — but which rounding?

Store each weight with fewer bits, and choose those bits so the model's outputs barely change — not so the weights themselves barely change. Almost every clever method is a different answer to "fewer bits, but which rounding minimizes the output error?"

WHY IT WORKS

networks are robust

Each weight is one tiny vote in a sum over thousands. Nudge any one to its nearest allowed value and the noise partly cancels — like an 8-bit photo looking identical to a 24-bit one.

THE PRIMITIVE ★

round-to-nearest

Find a block's range, carve it into N bins (16 for 4-bit), snap every weight to its nearest bin. Keep the bin index + a scale & zero-point to reconstruct.

THE CATCH

naive RTN sags

Plain round-to-nearest at 4 bits noticeably degrades large LLMs. GPTQ, AWQ and GGUF's k-quants are all ways to be smarter than nearest so 4-bit (even 3-bit) stays usable.

// store q = round( (w − zero) / scale ) // tiny int, 4 bits // reconstruct w_approx = scale · q + zero // an approximation

That is the entire low-level mechanism. One scale per block of weights (often 128 of them), plus a 4-bit index each. Everything else in this lesson is about choosing q more cleverly than "nearest".

THE THREE PAYOFFS

Memory — the binding constraint; FP16→int4 is a ~4× cut, the difference between "needs 2× A100" and "fits one card". Speed — inference is memory-bandwidth-bound, so quarter the bytes ≈ quarter the traffic. Reach — GGUF + llama.cpp run an 8B on a Raspberry Pi, a ThinkPad, a phone.

02 / 07

Round-to-nearest · drag the precision

Watch weights snap to a grid.

A block of continuous weights, and a grid of allowed levels. Drag the bit-width and watch the grid coarsen, every weight jump to its nearest level, and the rounding error grow. This is round-to-nearest, the primitive every method improves on.

FIG.02 — RTN ON ONE BLOCK · TRUE WEIGHT vs SNAPPED VALUE

2-bit 8-bit

BITS PER WEIGHT—

ALLOWED LEVELS—

STORAGE vs FP16—

MEAN ROUNDING ERROR—

Halve the bits and you halve the storage but double the grid spacing — so the worst-case rounding error doubles too. The accent stems show how far each weight was nudged. The whole field is a fight to keep that error small at 4 bits.

03 / 07

The shared objective · weight error vs output error

Minimize the output, not the weights.

Every serious method quantizes one linear layer at a time and asks: keep W·X close, not W close. Some weights matter far more — because the activations that multiply them are large. Drag the activation on the big weight and watch which rounding choice the smart method makes.

FIG.03 — SAME ROUNDING, DIFFERENT ACTIVATION · OUTPUT ERROR FLIPS

w·x for two weights · |w·x − ŵ·x| is what matters

small x large x

// RTN minimizes weight error argmin ‖ W − Ŵ ‖ // GPTQ / AWQ minimize OUTPUT error argmin ‖ W·X − Ŵ·X ‖²

The same 0.1 rounding error on a weight is harmless if its activation is tiny — and catastrophic if its activation is huge. Spend precision where it changes the output most.

GPTQ goes column-by-column: each time it quantizes a weight it nudges the remaining weights to compensate, using the activation curvature (the Hessian H = 2·X·Xᵀ). By the last column, every earlier mistake has been absorbed — error-correcting rounding. A 175B model quantizes to 3–4 bits in ~4 GPU-hours.

04 / 07

AWQ · protect salient weights by scaling

Scale it up, then round it.

AWQ's insight: only ~1% of weights matter, and you spot them by the activations, not the weights. Its trick avoids ugly mixed precision — scale a salient weight up by s before quantizing and scale its activation down by 1/s. The product is unchanged, but the larger weight uses more of the grid, so its relative rounding error shrinks. Sweep the scale.

FIG.04 — A SALIENT WEIGHT ON THE 4-BIT GRID · BIGGER = SHARPER

α=0 · plain RTN α=1 · full scale

Q(w)·x → Q(w·s)·(x / s) s > 1 s = s_X^α α* = argmin ‖ Q(W·s)·(X/s) − W·X ‖

SCALE s = s_X^α—

GRID SPACING (effective)—

RELATIVE ROUNDING ERROR—

No backprop, no Hessian — just activation statistics and a 1-D search over the single exponent α. Faster than GPTQ, and it generalizes better to instruction-tuned and multimodal models because it doesn't overfit the calibration set. It's the common default in vLLM / TensorRT-LLM serving.

05 / 07

GGUF + llama.cpp · k-quants for CPU / edge

Decode Q4_K_M.

GGUF is a file format; k-quants are its scheme — a different lineage, built for CPU and mixed CPU/GPU inference. It groups weights into super-blocks of 256, gives each sub-block its own tiny scale, then quantizes those scales too. Hover a row below — real Llama-3.1-8B numbers.

READING THE NAME

Q4_K_M = 4-bit · K-quant (two-level scaling) · Medium mix. The _S / _M / _L suffix says how aggressively the sensitive tensors (embeddings, lm_head) are kept at higher bit-width while the bulk feed-forward goes low.

imatrix ("importance matrix") is an optional calibration pass that — like AWQ/GPTQ — weights the quantization by which weights matter to real text. It rescues the low-bit quants (Q2/Q3) and is now standard for good GGUF releases.

Down to ~4–5 bits the loss is fractions of a percent. Below ~3 bits it falls off a cliff — reasoning and math degrade first.

FIG.05 — Llama-3.1-8B · llama.cpp · the accuracy ⇄ size tradeoff

Scheme	bpw	Size	Δsize	Perplexity	Bench loss

06 / 07

Make it concrete · will it fit?

A real memory calculator.

Memory is the binding constraint. Pick a model size and a quantization, and watch the weights shrink against the hardware you might actually have — including the blal.de box (62 GB RAM, no GPU).

FIG.06 — WEIGHT MEMORY vs HARDWARE · DOES IT FIT?

MODEL PARAMETERS

QUANTIZATION

PARAMETERS—

BITS / WEIGHT—

WEIGHT MEMORY—

vs FP16—

Weights only — real serving also needs the KV-cache + activations, so leave headroom. On a CPU box GGUF is runnable, not fast: "runs on my laptop" and "serves production traffic" are different bars.

07 / 07

The three methods · and the honest caveats

The family, and what to actually do.

Three lineages, one goal. They differ in how they decide the rounding.

Method	Idea	Needs	Best at	Used by
GPTQ	error-correcting rounding via the Hessian; nudge later columns to absorb earlier errors	GPU + calibration data	raw perplexity, big-GPU serving	many 4-bit HF checkpoints
AWQ	scale up salient weights (found via activations) before rounding; no backprop	calibration data only	instruction-tuned & multimodal	vLLM, TensorRT-LLM defaults
GGUF k-quants	two-level block scales + per-tensor mixed precision; optional imatrix	llama.cpp, CPU-friendly	CPU / edge / "runs everywhere"	llama.cpp, the local-LLM world

"NEAR-LOSSLESS" IS BENCHMARK-NEAR-LOSSLESS

A 0.5% average drop hides that quantization degrades unevenly: long-context retrieval, multi-step math, and rare-token handling suffer first. Q3_K_S collapsed GSM8K reasoning by 9.3 points in one study. Always eval on your task — perplexity is a weak proxy.

Same theme: the outlier problem is the central villain — a handful of feature dimensions have huge magnitudes that blow up the quantization range. AWQ's scaling, GGUF's per-sub-block scales, and GPTQ's error feedback are all ways of taming outliers.

WHAT TO REACH FOR

Q4_K_M is the default — ~4× smaller, ~0.5% loss, use unless you have a reason not to. Q5_K_M / Q6_K when you have the RAM and want near-lossless. Q8_0 when memory isn't the constraint (quality basically free). Q3 / Q2 only when desperate, and only with an imatrix.

Two more axes: this is all weight-only PTQ — activations stay FP16. Quantizing activations too (W8A8, SmoothQuant, FP8) is harder but unlocks compute, not just memory. And quantize-then-LoRA = QLoRA: fine-tune on frozen 4-bit weights.

02 · 03 — you made it

You shrank
a model.

The grid and round-to-nearest. Output error, not weight error. GPTQ's error correction, AWQ's scaling, GGUF's k-quants, and a memory calculator that says whether it fits. You now know how a 140 GB model lands on hardware you own. You hold the lever that democratized local LLMs.

02·01 RLHF & Alignment · reward models, PPO, DPO ✓ done

02·02 LoRA & PEFT · fine-tune billions by training a few soon

02·03 Quantization · FP16 → 4-bit · GPTQ, AWQ, GGUF ✓ complete

02·04 Scaling Laws · the math that predicts performance from compute next

Next · 02 · 04