A 70-billion-parameter model in native 16-bit weights needs ~140 GB just to hold the numbers — more than any consumer GPU has. Quantization stores each weight in 4 bits instead of 16, turning that 140 GB into ~40 GB, with surprisingly small quality loss. It is the single technique that put usable LLMs on hardware you actually own.
loading…
Continuous 16-bit weights (top dots) get snapped to the nearest of just 16 allowed levels — a 4-bit grid. Store only the level index (4 bits) plus one scale per block. The network barely notices, because it never depended on the 11th decimal of any single weight.
Store each weight with fewer bits, and choose those bits so the model's outputs barely change — not so the weights themselves barely change. Almost every clever method is a different answer to "fewer bits, but which rounding minimizes the output error?"
Each weight is one tiny vote in a sum over thousands. Nudge any one to its nearest allowed value and the noise partly cancels — like an 8-bit photo looking identical to a 24-bit one.
Find a block's range, carve it into N bins (16 for 4-bit), snap every weight to its nearest bin. Keep the bin index + a scale & zero-point to reconstruct.
Plain round-to-nearest at 4 bits noticeably degrades large LLMs. GPTQ, AWQ and GGUF's k-quants are all ways to be smarter than nearest so 4-bit (even 3-bit) stays usable.
That is the entire low-level mechanism. One scale per block of weights (often 128 of them), plus a 4-bit index each. Everything else in this lesson is about choosing q more cleverly than "nearest".
Memory — the binding constraint; FP16→int4 is a ~4× cut, the difference between "needs 2× A100" and "fits one card". Speed — inference is memory-bandwidth-bound, so quarter the bytes ≈ quarter the traffic. Reach — GGUF + llama.cpp run an 8B on a Raspberry Pi, a ThinkPad, a phone.
A block of continuous weights, and a grid of allowed levels. Drag the bit-width and watch the grid coarsen, every weight jump to its nearest level, and the rounding error grow. This is round-to-nearest, the primitive every method improves on.
Halve the bits and you halve the storage but double the grid spacing — so the worst-case rounding error doubles too. The accent stems show how far each weight was nudged. The whole field is a fight to keep that error small at 4 bits.
Every serious method quantizes one linear layer at a time and asks: keep W·X close, not W close. Some weights matter far more — because the activations that multiply them are large. Drag the activation on the big weight and watch which rounding choice the smart method makes.
The same 0.1 rounding error on a weight is harmless if its activation is tiny — and catastrophic if its activation is huge. Spend precision where it changes the output most.
GPTQ goes column-by-column: each time it quantizes a weight it nudges the remaining weights to compensate, using the activation curvature (the Hessian H = 2·X·Xᵀ). By the last column, every earlier mistake has been absorbed — error-correcting rounding. A 175B model quantizes to 3–4 bits in ~4 GPU-hours.
AWQ's insight: only ~1% of weights matter, and you spot them by the activations, not the weights. Its trick avoids ugly mixed precision — scale a salient weight up by s before quantizing and scale its activation down by 1/s. The product is unchanged, but the larger weight uses more of the grid, so its relative rounding error shrinks. Sweep the scale.
No backprop, no Hessian — just activation statistics and a 1-D search over the single exponent α. Faster than GPTQ, and it generalizes better to instruction-tuned and multimodal models because it doesn't overfit the calibration set. It's the common default in vLLM / TensorRT-LLM serving.
GGUF is a file format; k-quants are its scheme — a different lineage, built for CPU and mixed CPU/GPU inference. It groups weights into super-blocks of 256, gives each sub-block its own tiny scale, then quantizes those scales too. Hover a row below — real Llama-3.1-8B numbers.
Q4_K_M = 4-bit · K-quant (two-level scaling) · Medium mix. The _S / _M / _L suffix says how aggressively the sensitive tensors (embeddings, lm_head) are kept at higher bit-width while the bulk feed-forward goes low.
imatrix ("importance matrix") is an optional calibration pass that — like AWQ/GPTQ — weights the quantization by which weights matter to real text. It rescues the low-bit quants (Q2/Q3) and is now standard for good GGUF releases.
Down to ~4–5 bits the loss is fractions of a percent. Below ~3 bits it falls off a cliff — reasoning and math degrade first.
| Scheme | bpw | Size | Δsize | Perplexity | Bench loss |
|---|
Memory is the binding constraint. Pick a model size and a quantization, and watch the weights shrink against the hardware you might actually have — including the blal.de box (62 GB RAM, no GPU).
Weights only — real serving also needs the KV-cache + activations, so leave headroom. On a CPU box GGUF is runnable, not fast: "runs on my laptop" and "serves production traffic" are different bars.
Three lineages, one goal. They differ in how they decide the rounding.
| Method | Idea | Needs | Best at | Used by |
|---|---|---|---|---|
| GPTQ | error-correcting rounding via the Hessian; nudge later columns to absorb earlier errors | GPU + calibration data | raw perplexity, big-GPU serving | many 4-bit HF checkpoints |
| AWQ | scale up salient weights (found via activations) before rounding; no backprop | calibration data only | instruction-tuned & multimodal | vLLM, TensorRT-LLM defaults |
| GGUF k-quants | two-level block scales + per-tensor mixed precision; optional imatrix | llama.cpp, CPU-friendly | CPU / edge / "runs everywhere" | llama.cpp, the local-LLM world |
A 0.5% average drop hides that quantization degrades unevenly: long-context retrieval, multi-step math, and rare-token handling suffer first. Q3_K_S collapsed GSM8K reasoning by 9.3 points in one study. Always eval on your task — perplexity is a weak proxy.
Same theme: the outlier problem is the central villain — a handful of feature dimensions have huge magnitudes that blow up the quantization range. AWQ's scaling, GGUF's per-sub-block scales, and GPTQ's error feedback are all ways of taming outliers.
Q4_K_M is the default — ~4× smaller, ~0.5% loss, use unless you have a reason not to. Q5_K_M / Q6_K when you have the RAM and want near-lossless. Q8_0 when memory isn't the constraint (quality basically free). Q3 / Q2 only when desperate, and only with an imatrix.
Two more axes: this is all weight-only PTQ — activations stay FP16. Quantizing activations too (W8A8, SmoothQuant, FP8) is harder but unlocks compute, not just memory. And quantize-then-LoRA = QLoRA: fine-tune on frozen 4-bit weights.
The grid and round-to-nearest. Output error, not weight error. GPTQ's error correction, AWQ's scaling, GGUF's k-quants, and a memory calculator that says whether it fits. You now know how a 140 GB model lands on hardware you own. You hold the lever that democratized local LLMs.
You can now run big models small. Next: the math that predicts exactly how good a model will be from its compute, data and parameter budget.