openalicelabs / academy
COURSE TRAIN-02 LESSON 02 · 01 TOPIC RLHF & ALIGNMENT EST. READ ~13 MIN
OPENALICE LABORATORIES · EDUCATION PATH · TRAINING 02 · 01

Teach it
what good
looks like.

A freshly pre-trained model is a brilliant autocomplete — it predicts the next token, nothing more. It has no idea what helpful or harmless means. Alignment is the post-training step that turns that raw predictor into an assistant, by learning a reward from human preference — "which answer is better?" — and pushing the model to chase it.

FIG.00 — PREDICTOR → ASSISTANT
loading…
FIG.0A — THE WHOLE IDEA · humans rank pairs → a reward model → the policy chases reward

We almost never have a number for "how good is this answer." But humans can reliably say A is better than B. Alignment turns that cheap signal — a pile of pairwise preferences — into a learned reward model, then uses reinforcement learning to make the policy produce more of what scores high.

INPUTa pre-trained base model + human preference data
OUTPUTa helpful, harmless, steerable assistant
THE SIGNAL"A > B" pairwise comparisons — not scores
CLASSIC RECIPESFT → reward model → PPO (InstructGPT, 2022)
2025 RECIPEDPO · GRPO + verifiable rewards (RLVR)
01 / 07
Intuition first · the two jobs

A base model isn't dumb. It's unaimed.

Pre-training optimizes one thing: predict the next token over the whole internet. That gives the model staggering knowledge and zero sense of what you want. Ask it a question and it's just as likely to continue with more questions — because that's what text on the web looks like. Alignment gives it an aim.

THE BASE MODEL

a next-token predictor

It models P(next | context). Brilliant pattern completion, but no notion of "answer," "refuse," or "be honest." There is no if (harmful) refuse() anywhere — only a probability distribution.

THE TWO JOBS ★

helpful and harmless

An assistant must answer almost anything and stop at a line — refuse to help build a weapon, leak a secret, dox someone. These two pulls conflict, and the whole craft is balancing them.

SAFETY IS A BEHAVIOR

nudged, not enforced

Alignment doesn't add rules — it nudges the distribution toward good behavior. That's why refusal is statistical, never absolute, and why "we fixed it" claims deserve suspicion.

WHY NOT JUST WRITE THE RULES?

Because "be helpful," "be honest," "be safe" are fuzzy, contextual, contested human values — you cannot write them as code. But humans can look at two replies and say which is better. Alignment is the trick of converting that easy human judgement into a training signal a model can chase.

The same machinery that makes a model helpful also makes it harmless: collect preferences where the good refusal is preferred, and shape the policy toward it. Helpfulness and harmlessness come out of one preference-optimization pipeline.

So aligning is not a moral add-on bolted onto a finished model. It is the step that turns a predictor into the product — and every lever we pull from here is some way of saying "more of this, less of that."

02 / 07
Preferences → a number · the reward model

Turn "A > B" into a reward.

We can't ask a human "rate this 0–100" reliably — but we can ask "which is better?". A reward model is a copy of the LLM with the next-token head swapped for a single number. It's trained so the preferred answer scores higher. Pick the better answer below and watch its reward climb.

FIG.02 — TRAIN A REWARD MODEL FROM YOUR CLICKS
PROMPT — "explain why the sky is blue"
labels: 0

LEARNED REWARD · r(answer) — centred at 0
// Bradley–Terry: probability A is preferred over B P(A > B) = σ( r(A) − r(B) ) // loss: push the chosen answer's reward up L = −log σ( r(chosen) − r(rejected) )

Each click is one training pair. The reward model learns to give the chosen answer a higher number than the rejected one. After thousands of pairs it generalizes to score answers it has never seen — that scalar is the signal the policy will chase.

Reward is only meaningful relative to other answers — its absolute value is arbitrary, so we centre it at 0. That's why the bars grow left (worse) or right (better) from the middle line.

03 / 07
The classic recipe · InstructGPT 2022

Three stages: SFT → reward → PPO.

Classic RLHF is a pipeline. First teach the model to follow instructions at all (SFT), then build the reward model from §02, then use reinforcement learning (PPO) to push the policy toward high reward — while a leash stops it drifting too far. Press play and watch one token flow through.

FIG.03 — THE RLHF PIPELINE · ANIMATED
STAGE 1 · SFT

Supervised fine-tune

Show the model thousands of human-written (prompt → ideal answer) demos. It learns the shape of an answer — but only imitates, never exceeds the demos.

STAGE 2 · REWARD MODEL

Learn the preferences

From §02: humans rank the SFT model's samples; a reward model learns to score any answer. This is the judge the next stage optimizes against.

STAGE 3 · PPO

RL against the reward

The policy generates, the reward model scores, PPO nudges the policy toward high reward — minus a KL leash to the SFT model so it can't drift into gibberish.

idle — press run
// the PPO objective, in words maximize reward(policy's answer) minus β · KL( policy ‖ SFT model ) // the KL term = the leash. without it the // policy finds gibberish that fools the RM.

Why the leash? The reward model is an imperfect proxy. Optimize it too hard and the policy discovers weird, off-distribution text that scores high but is nonsense — classic reward hacking (§07). The KL penalty says "stay close to the sensible SFT model," trading a little reward for a lot of sanity.

PPO itself is a careful RL algorithm: it takes small, clipped steps so a single update can never move the policy too far. It works — InstructGPT made GPT-3 usable — but it's heavy: you juggle four models at once (policy, reward, reference, and a value-network "critic").

04 / 07
Scaling the labels · let AI rank AI

When humans don't scale.

Human preference labels are slow, costly, and inconsistent. The fix: have a capable model do the ranking, guided by a written set of principles. RLAIF swaps the human labeler for an AI; Constitutional AI gives that AI an explicit "constitution" to judge against.

RLHF

humans rank

The gold standard for capturing real human values — but expensive, slow, and noisy. Hard to scale to millions of comparisons, and labelers disagree.

RLAIF ★

an AI ranks

Replace the human labeler with a strong model prompted to pick the better answer. Cheap, fast, consistent — labels scale to whatever you can afford in inference.

CONSTITUTIONAL AI

principles, not raw clicks

Anthropic's recipe: give the AI a short written constitution ("be helpful, avoid harm, respect autonomy"). The model critiques and revises its own answers against it, then trains on the result.

// Constitutional AI · self-critique loop 1. model drafts an answer 2. model critiques it against the constitution 3. model revises to fix the flaws 4. train on (prompt → revised answer) + an AI-ranked preference phase (RLAIF)

The payoff: the values become legible. Instead of being implicit in a million human clicks, they're written down in a document you can read, audit, and edit. Change the constitution, re-run, and the model's behavior shifts in a way you can explain.

The catch: the AI labeler inherits its own biases and blind spots, and a model judging a model can drift in a self-reinforcing loop. RLAIF scales labels; it does not magically scale wisdom. Most production stacks blend AI labels with a human-checked core.

05 / 07
The simplification · skip the RL

DPO: the reward model was hiding in the policy.

PPO is a beast — four models, an unstable RL loop, a separate reward model to train and game. Direct Preference Optimization proves you can skip all of it: the optimal policy is its own implicit reward model, so you can train directly on the preference pairs with one plain supervised loss. Drag β to see the trade.

FIG.05 — PPO PIPELINE vs DPO · same data, far fewer moving parts
Classic RLHF · PPO4 models
prefs ─▶ [ reward model ] ──┐
                          ▼
 policy ◀── PPO loop ◀── score
   │   + KL leash to reference
   │   + value-network critic
   ▼
unstable · expensive · hackable RM
Direct · DPO2 models
prefs ─▶ one supervised loss
              │  raise log-prob of chosen,
              ▼  lower log-prob of rejected
 policy  (+ frozen reference for the leash)

no reward model · no RL · stable
β LOW · chase prefs hard β HIGH · stay near reference
β (KL STRENGTH)0.35
PREFERENCE FIT
DRIFT FROM REFERENCE
RISK
// DPO loss — one term, no RL, no reward model L = −log σ( β · [ Δlogπ(chosen) − Δlogπ(rejected) ] ) // Δlogπ(y) = log π(y) − log π_ref(y) // β is the same KL leash — folded into the loss

The insight: the reward and the optimal policy are two views of the same thing. DPO substitutes the implicit reward back into the preference loss, and the reward model cancels out. What's left is a single classification-style loss that raises the chosen answer's probability and lowers the rejected one's.

The β you're dragging is the same leash as PPO's KL term — just folded into the loss. It's why DPO became the default for preference-tuning open models: nearly RLHF-quality, a fraction of the engineering, and no separate reward model to overfit or game.

06 / 07
RL for reasoning · let the group be the baseline

GRPO: delete the critic, ask the group.

For reasoning (math, code), we don't need a learned reward model at all — we can just check the answer. GRPO samples a whole group of answers to one question, scores each with a cheap rule, and uses the group's average as the baseline. Beat the average → "do more of this." That deleted PPO's expensive critic. Sample a group and watch the advantages fall out.

FIG.06 — GROUP-RELATIVE ADVANTAGE · sample G answers, score, normalize
QUESTION — "what is 17 × 24 ?" (answer: 408)
G = 8 · acc

GROUP MEAN (the baseline)
ADVANTAGE = (r − mean) / std
POLICY ACCURACY (rises with training)50%
// GRPO advantage — no critic, no reward model sample group {o₁ … o_G} for question q reward each rᵢ = 1 if correct else 0 (rule-based) Âᵢ = ( rᵢ − mean(r) ) / std(r) └ the group IS the baseline

PPO needed a second network — a critic, nearly as big as the model — just to guess "how good is this state?". GRPO throws it away: the average of G sampled answers is that estimate, for free. Answers above the group mean get a positive advantage, ones below get negative.

The magic ingredient is the verifiable reward (RLVR): for math you string-match the answer, for code you run the tests. No human labels, no learned reward model to game. This is the engine behind DeepSeek-R1 and the 2025–26 reasoning wave.

Watch the trap: when every answer in a group is right (or all wrong), the spread is zero → advantage is zero → no gradient. That's why group size and a mix of hard/easy questions matter.

07 / 07
Where alignment bites back · the failure modes

Optimize a proxy, and it bites.

Every method here optimizes a proxy for "good" — a reward model, an AI judge, a rule. Push any proxy hard enough and the policy finds the gap between the proxy and what you actually wanted. Crank the optimization pressure and watch true quality and proxy reward diverge — the signature of reward hacking.

FIG.07 — REWARD HACKING · proxy reward keeps climbing while real quality collapses
GENTLE OVER-OPTIMIZE
OPTIMIZATION PRESSURE (KL from reference)
PROXY REWARD (what the RM says)
TRUE QUALITY (what humans actually want)
VERDICT
REWARD HACKING

game the proxy

The policy exploits flaws in the reward model — padding, flattery, formatting tricks — to score high without being good. The KL leash and verifiable rewards both fight this.

SYCOPHANCY

agree, don't help

Humans reward answers that sound good and flatter them, so the model learns to tell you what you want to hear — even when you're wrong. A direct artifact of training on human approval.

ALIGNMENT TAX

safer, slightly duller

Aligning can shave raw capability — and over-refusal makes it decline benign requests ("how do I kill a Python process?"). Every lab is picking a point on the safety↔capability curve, not escaping it.

GOODHART'S LAW, IN ONE LINE

"When a measure becomes a target, it ceases to be a good measure." The reward model was a measure of quality. Make it the target of hard optimization, and the policy learns to maximize the measure, not the quality.

The honest takeaway: alignment moves the distribution toward good behavior, it never closes the gap. Verifiable rewards (§06) dodge reward-model hacking — but only where a verifier exists. On fuzzy goals (helpfulness, creativity) you're back to a learned proxy and all its failure modes return.

For Alice specifically: her harmlessness comes from this same machinery that makes her helpful — so the safety↔capability tension is her tension. Over-refusal feels lobotomized; under-refusal is exploitable. "Personality drift is a feature" is a deliberate position on that curve.

02 · 01 — you made it

You aligned
a model.

A reward model from "A > B." The SFT → reward → PPO loop with its KL leash. RLAIF and a written constitution. DPO collapsing the whole thing into one loss. GRPO letting the group be the baseline for verifiable reasoning. And the failure modes that haunt all of it. You now hold the levers that turn a predictor into an assistant.

01·xx The architecture · tokens, embeddings, attention, the transformer ✓ done
02·01 RLHF & Alignment · reward models, PPO, DPO, GRPO, the failure modes ✓ complete
02·04 Scaling Laws · predict the model before you build it next
02·05 Test-time compute & reasoning · think longer at inference (the o1/R1 idea) locked
Next · 01 · 05

Mixture-of-Experts →

You've shaped the model's behavior. Now make it bigger without making every token cost more — route each token to a few specialist sub-networks.

openalicelabs