Machine Learning / 2022 / arXiv

Training Language Models to Follow Instructions with Human Feedback

Training Language Models to Follow Instructions with Human Feedback — Ouyang et al.

A pretrained language model is not an assistant. It's a thing that completes text the way the internet completes text — including unhelpfully, evasively, or with whatever the most likely continuation happens to be. RLHF is what closes the gap.

Read original paper Back to papers

Imagine you ask GPT-3 (the actual 2020 GPT-3, not anything you've used in a chat interface) the question: Explain the moon landing to a 6 year old. A reasonable continuation, given that this string mostly appears on the internet inside lists of example questions, is: Explain the theory of gravity to a 6 year old. Explain the theory of relativity to a 6 year old. Explain the big bang to a 6 year old. The model isn't broken. It's doing exactly what it was trained to do — predict the next token in a document where that question appeared. The document just happened to be a list, not an answer.

This is the gap the InstructGPT paper closes. A pretrained LM has, somewhere inside it, the ability to write a great explanation of the moon landing for a six-year-old. It also has the ability to continue the sentence as a list, write it in French, write it as a Reddit thread, write it as the opening of a fanfic. Pretraining doesn't pick. The model emits whatever the most probable continuation is given the prompt-shaped soup of the training data. To get an assistant, you have to teach the model that the user wrote a question and wants an answer is the relevant frame, then teach it which kind of answer is good.

Think of pretraining as having read every book in a library, including the indexes, the appendices, the marginalia, the back-cover blurbs, the catalogue cards, and a few thousand fanfics someone uploaded by mistake. The model does not know which of those it is being. It will happily continue your prompt in whatever genre the prompt most resembles. When you type a question that pattern-matches to a Stack Overflow header, you get Stack-Overflow-shaped text. When you type one that pattern-matches to a worksheet, you get worksheet-shaped text. The capability to answer is in there. The model just doesn't know that answer is what you're asking for.

The InstructGPT recipe — supervised fine-tuning, then a reward model, then PPO with a KL tether — is how OpenAI did it in 2022. It's the recipe that turned GPT-3 into the thing that became ChatGPT. The same three stages, with variations, run inside basically every chat model you've used since. It's worth understanding all three because nearly every alignment failure mode that's been discussed publicly — sycophancy, refusal weirdness, hallucinated confidence, length bias — sits on a knob inside one of them.

the InstructGPT recipe, three jobs stacked

RLHF is three training jobs, not one.

Each stage takes a different kind of human input, minimises a different loss, and produces a different artifact. Hover a stage to inspect it.

Human role

~13k demonstrations, written by labelers

Scaling

expensive per token — humans have to write whole answers

What gets optimized

cross-entropy on the demonstration tokens

Artifact handed to next stage

a fine-tuned LM that answers in assistant shape

The three stages are not three runs of the same algorithm with different data. They are three distinct training jobs with three distinct losses, three distinct kinds of human input, and three distinct artifacts. Hover each stage to see what goes in and what comes out.

Why pretraining alone doesn't get you ChatGPT

Pretraining optimises one objective: minimise the log-loss of the next token, averaged over a giant dump of internet text. That objective is breathtakingly powerful. It's enough to produce a model that knows physics, can write code, can argue both sides of a court case, can do bad poetry in the style of Wallace Stevens. But the objective has nothing to say about what kind of continuation is useful to a user typing into a text box.

Concretely: the training set contains lots of question-answer pairs, but it also contains lots of unanswered questions, sarcastic non-answers, stack traces under questions, lists of similar questions, and entire forum threads where the question is asked and the actual answer is buried thirty replies down. When you ask GPT-3 a question, you're not picking a behavior. You're sampling from the conditional distribution of "what comes next on the internet after this string," weighted by how much of the training data looked like each genre.

Few-shot was the original workaround

The early GPT-3 paper had a clever workaround: few-shot prompting. Give the model two or three examples of the format you want, and the conditional distribution shifts. Q: capital of France? A: Paris. Q: capital of Japan? A: Tokyo. Q: capital of Peru? A: — now Lima is the most likely continuation. This works because the model has learned that documents are usually internally consistent: if the first three Q-A pairs follow a format, the fourth will too. Few-shot prompting makes the user do the work of nailing down the genre.

Few-shot prompting is brittle. It eats context tokens. It fails when the user wants behaviors that don't show up cleanly as a worked example — be honest about uncertainty, don't help with this category of harmful request, prefer concise answers when the question is simple. You'd need a thousand-shot prompt and the model still wouldn't generalise the way you want. The path forward is not better prompts. The path forward is to change the model so its default behavior is to be a helpful assistant.

Stage 1: SFT, or teach it the format

The first stage is the most boring and the most necessary. You hire labelers, give them prompts, and have them write the response they would want to see if they were the user. Then you fine-tune the pretrained model on these (prompt, response) pairs with a regular cross-entropy loss. That's it. This step is called supervised fine-tuning, or SFT.

The InstructGPT paper used about 13,000 demonstrations from a small team of contracted labelers. Each labeler was given a prompt — drawn from API users (with consent), from the labelers themselves, and from a few seed-prompt datasets — and asked to write the ideal answer. Average length: a couple of paragraphs. Average labeler effort: real, but not crazy. A labeler can do maybe forty or fifty good demonstrations a day before quality drops.

SFT does two things. It teaches the model the shape of an assistant turn — the response goes here, addressed to the user, in roughly this length and tone. And it pulls the model's distribution sharply toward "answer the question" instead of "continue the document." After SFT, Explain the moon landing to a 6 year old gets a moon landing explanation, not a list. The post-SFT model is not yet ChatGPT-good. It is, however, recognisably an assistant.

What SFT can and can't convey

The geometry to picture: pretraining puts the model on a giant manifold where every point is some plausible continuation of internet text. SFT does not move the model very far on that manifold — the gradients are small, the data is small — but it picks a region of the manifold (the assistant-turn region) and gives the model a strong default to land there when prompted with the chat template. After SFT, sampling from the model means sampling from a much narrower distribution: roughly, the kinds of responses our labelers would write.

What SFT can't do is convey nuance the labeler didn't write down. If you want the model to refuse harmful requests, hedge under uncertainty, prefer one style of answer over another, and avoid sycophancy, you'd need to write demonstrations of every shade of every behavior. Demonstrations are expensive and don't transfer well between cases. A labeler writing the perfect answer to one prompt is producing one data point that locks in their personal taste, their reading of the situation, their phrasing tics. Two labelers writing answers to the same prompt produce different data, and the model has to average across both. So we move to a cheaper signal: comparison.

Stage 2: the reward model

Here's the trick that makes this whole recipe scale. It is much harder for a labeler to write a perfect response than to judge which of two candidate responses is better. Writing a paragraph from scratch under labeling time pressure is hard. Reading two paragraphs and clicking the better one is easy. Cheap human signal beats expensive human signal when you can convert it into the right shape.

So you sample two (or four, or k) responses from your SFT model for the same prompt, show them to the labeler, and ask: which is better? You get back a partial ranking — A > B > D > C — for that prompt. The InstructGPT paper used k = 4 to k = 9, and turned each ranked list into all the pairs it implied. From a 4-way ranking you get six pairs. The total comparison budget was about 33,000 prompts, expanding to around 65,000 pairwise judgments.

Bradley-Terry as the fitting objective

Now you train a separate model — the reward model, or RM — to predict that ranking. The RM is initialized from the SFT model with the language-modeling head replaced by a single scalar output. You train it with a pairwise loss: for any pair where the labeler said A > B, the RM should give A a higher scalar score than B. The exact loss is the Bradley-Terry objective, which is the standard way to fit a latent score from pairwise comparisons:

L(θ) = − E_{(x, y_w, y_l)} [ log σ(r_θ(x, y_w) − r_θ(x, y_l)) ]

Read that loss out loud: for every pair where y_w won and y_l lost, push the winner's score above the loser's, and the size of the push is proportional to how confident the model already is. Pairs the RM gets right with high margin contribute almost no gradient. Pairs the RM gets wrong contribute a lot. This is exactly the dynamics you want — most of the labeling budget goes toward the disagreements, not the easy cases.

−log σ(r_a − r_b)

Click your preferences. Watch a scalar reward emerge.

Four candidates with a hidden true quality. Click the better of two and the Bradley-Terry head runs one gradient step. Or simulate 50 noisy clicks to see what convergence looks like at scale.

preferovertotal prefs 0 · ranks matched 4/4

Candidates · learned reward r(c)

Quick, accurate two-sentence answer with a concrete number.

truth = 1.4

r = 0.00

Solid answer. Slightly hedged. Adds one useful caveat.

truth = 0.6

r = 0.00

Vague restatement of the question. Mostly filler.

truth = -0.4

r = 0.00

Confidently wrong. Asserts the opposite of correct.

truth = -1.6

r = 0.00

Predicted P(row beats column) under current r

ABCDA

0.50

Preference votes accumulated

ABCDA

row beats column · darker = more votes

Click A over D a few times. Then A over C, B over D. Watch the rewards spread out. Hit +50 simulated clicks to see what 50 noisy labelers worth of data looks like — the ranks settle even when individual votes disagree. That's the whole magic of Bradley-Terry: aggregate noise into a scalar.

Click pairwise preferences and watch the Bradley-Terry head fit a scalar. With four candidates and a hidden ground truth, it takes only a few clicks to nail the ordering. The simulator shows what 50 noisy labelers' worth of data looks like — the ranks settle even when individual votes disagree.

What you get out is a function r(prompt, response) → real number that approximates "how good a human would judge this response." It's not perfect — it's a model trained on tens of thousands of comparisons, not hundreds of millions — but it's a cheap, dense, automatic judge. That's the whole reason this stage exists. You traded "a labeler has to write a perfect answer" for "a labeler has to click the better of two answers," and you turned the click into a function you can call a million times a day.

The RM's architecture is almost identical to the SFT model. Same tokenizer, same transformer stack, same hidden states. The only change is the final head: instead of projecting to vocab logits, project to a single scalar. The InstructGPT paper used a 6B-parameter RM to grade a 175B-parameter policy, which sounds backwards but works fine — you don't need the world's best language model to score whether response A is better than response B; you need a model that has read enough preference data to capture the labelers' average taste.

Stage 3: PPO under a KL tether

Now you have a way to score any response numerically. The natural move is reinforcement learning: treat the language model as a policy π, sample responses, score them with the RM, and update π to make high-scoring responses more likely. The InstructGPT paper uses Proximal Policy Optimization (PPO), which is a particular RL algorithm from the robotics literature that tends not to blow up. PPO is not the point — any policy-gradient algorithm with a trust-region-flavoured update would work. The point is the objective.

Here is where you'd hope it just works. It does not just work. If you let PPO maximize the reward model's score with no other constraint, the policy will rapidly drift to outputs that the RM scores extremely high but that look like nothing humans actually want — sometimes literal gibberish, sometimes weirdly repetitive text, sometimes outputs that exploit specific quirks of the RM. The reward model was trained on responses that came from somewhere near the SFT model's distribution. As soon as the policy moves far from there, the RM is extrapolating, and its scores stop meaning anything. The policy is then optimising a number that has no relationship to human preference. This goes wrong fast.

The fix is a KL penalty added to the reward. The actual objective being optimised at each step is:

objective = E_{(x, y) ∼ π} [ r_θ(x, y) − β · KL(π(·|x) ‖ π_SFT(·|x)) ]

The first term is what the RM thinks of the response. The second term punishes the policy for moving its distribution too far from the SFT model. β controls how tight the leash is. You can read this as: "get high reward, but stay close to a model that produces fluent, on-distribution English." The KL is what keeps the optimizer honest. It's the difference between an assistant and a thing that scribbles whatever the RM happens to like.

Where does this objective come from?

The form E[r] − β·KL(π ‖ π_ref) is not arbitrary. It's the dual of a hard trust-region constraint. If you start from the constrained problem "maximise expected reward subject to KL(π ‖ π_ref) ≤ ε," then by Lagrangian duality you get exactly an unconstrained problem of the form E[r] − β·KL, where β is the Lagrange multiplier on the KL constraint. β is just the trust-region radius, expressed as a penalty rather than a hard cap.

Once you write it that way, an even nicer fact pops out. On any finite set of candidate responses, the policy that maximises E[r] − β·KL(π ‖ π_ref) has a closed form: π(y|x) ∝ π_ref(y|x) · exp(r(x,y) / β)*. This is the Gibbs distribution with reward-as-energy, tilted by π_ref. PPO is doing approximate gradient steps toward this fixed point. DPO, which we'll get to later, exploits this exact closed form to skip the explicit RM entirely.

π*(a) ∝ π_SFT(a) · exp(r(a)/β)

The KL tether is the only thing keeping the policy honest.

The base policy is a Gaussian over a toy 1D action axis. The reward function (gold) puts mass somewhere else, with a sharp secondary peak the RM mistakenly likes. Drag β to slide between the two extremes.

Slide β small. The RL policy collapses onto the secondary reward peak — that's reward hacking on the toy axis. Slide β large. The RL policy returns to the base distribution and grabs almost no reward. The frontier on the right traces every β you can pick. Production assistants live somewhere on the middle stretch.

A 1D toy: π_SFT is a Gaussian over an action axis (think: response style). The reward function (gold) puts mass somewhere else, with a sharp secondary peak the RM mistakenly likes. Drag β and watch the RL policy slide along the closed-form solution. The frontier on the right is every β you can pick — production assistants live somewhere on the middle stretch.

Almost every alignment failure mode you've heard of can be located somewhere on this leash. Tighten β too much and the model can't change at all — it's just SFT with extra steps. Loosen β too much and you get reward hacking. The sweet spot is a narrow band that depends on the RM's quality, the diversity of the prompt distribution, and how long you train. Most production teams report spending more wall-clock time tuning β and the RM's data mix than tuning anything in the policy gradient itself.

PPO itself contributes one more piece of regularisation: a clipping term on the ratio of new-policy probability to old-policy probability for each token. The clip prevents any single batch from moving the policy too far in one direction, which gives you a second layer of trust-region control on top of the KL. PPO without the clip can still work but is more fragile. PPO without the KL term is a runaway optimiser. PPO with both is the workhorse that shipped through 2023.

Why all three stages are needed

It's tempting to ask whether you can skip one. The honest answer is: yes, sort of, at a cost.

Skip SFT and start RL from the pretrained model directly: the policy doesn't know what shape an assistant turn looks like, so the early samples are mostly irrelevant text and the RM has nothing useful to grade. RL barely moves. SFT bootstraps you onto the manifold where rewards are meaningful. Without that bootstrap the policy spends weeks of compute drifting around the soup of internet text before it accidentally produces something assistant-shaped.

Skip the reward model and try to do RL directly from human ratings: humans are slow. Each PPO step needs gradient signal from thousands of rollouts; if a human has to grade each rollout, you've capped your training rate at maybe a hundred steps a day. The RM is a 1000× speedup. It's also a 1000× consistency improvement — the same RM gives the same response the same score, which is more than you can say for two human labelers.

Skip PPO and just do SFT with more demonstrations: this is what happens in pure-instruction-tuning recipes. It works surprisingly well — Alpaca, Vicuna, the early open-source models lived here — but you can't easily teach behaviors that depend on relative judgments ("hedge slightly more," "prefer this style of answer"), and you can't train against your own samples to fix specific failure modes. You hit a ceiling. The InstructGPT paper actually compared SFT-only against SFT+PPO and found PPO added a real preference gain on top of even very large SFT data.

Each stage does work the others can't

The three stages are doing distinct work. SFT teaches what an assistant looks like. The RM compresses human preference into a fast scoring function. PPO under KL moves the policy toward the RM's preferences without breaking it. Each step's output is the next step's input. You cannot rearrange the order — the RM has to be initialised from a model that already produces assistant-shaped text, or else its training data has nothing to do with what the RM will be scoring at deployment time.

Bradley-Terry → KL-tethered policy

Click preferences. Watch the reward fit. Watch the policy follow.

Each pairwise click runs one gradient step on −log σ(r_a − r_b). The reward model adjusts. The policy is the closed-form optimum of E[r] − β·KL(π ‖ π_base): π(c) ∝ π_base(c)·exp(r(c)/β).

promptHow do I make my coffee taste better?

preferover

Try preferring A over D a few times, then A over C. The reward model fits your votes. Now drag β. Tight tether (high β) holds the policy near the base distribution; loose tether (low β) lets it chase high-reward answers and the KL grows. The narrow stripe in between is where production assistants live.

All three stages collapsed into one panel. Click pairwise preferences to fit a Bradley-Terry reward model. Drag β to tether the policy to the base. Watch a real RLHF loop in miniature — preferences in, reward out, policy reweighted, KL bounded.

Reward hacking, sycophancy, and the helpful-honest gap

RLHF is not a clean optimization. Three failure modes show up consistently, and each one is more or less fundamental — not a bug to patch but a property of the recipe.

Reward hacking

The policy finds responses that score high on the RM but aren't what humans actually wanted. Classic examples: padding answers with confident-sounding qualifiers because labelers prefer fluent text, restating the question before answering because that pattern correlated with being preferred, adding bullet points even when bullets are inappropriate because bulleted text reads as organised. The RM is a proxy. Optimising against a proxy hard enough finds the gaps between the proxy and the true objective. This is just Goodhart's Law with extra steps.

The most-cited concrete case from the InstructGPT paper appendix is length bias: the RM, fit on labeler clicks, picked up that labelers on average preferred slightly longer responses. After PPO, the policy started writing dramatically longer responses than the SFT model, even when the question called for one sentence. Was the model getting better, or was it just getting wordier? Both, somewhat. The point is that the proxy and the truth diverged, and once they did, optimisation pressure found the wedge.

Goodhart on a length-biased reward model

The RM rewards length. The policy notices.

Five candidate responses, sorted short to long. The reward model was trained on labeler clicks where, on average, labelers slightly preferred longer answers. The RM learned length as a feature. With β too small, the policy concentrates on the longest, worst response.

idresponselengthRM r(c)policy π(c) under β · grey: true quality

ADirect, accurate, addresses the question in two sentences.28t1.20

quality 92%π 2%

BGood answer, but adds restated context, mild hedging, a closing summary.86t2.10

quality 78%π 7%

CLong-winded, repeats premise, lists caveats, drifts off-topic, ends with a recap.220t2.80

quality 55%π 19%

DWall of text. Padding, hedges, vague generalities, no concrete advice.380t3.18

quality 32%π 31%

EMaximum length. Gibberish-adjacent. The reward-hack winner.540t3.42

quality 18%π 41%

With β around 0.6 the policy lands in a healthy regime — most of its mass on responses A and B, the short and accurate ones. Pull β down toward 0.05 and the optimizer chases the RM's spurious length preference until it lives on candidate E. The fix isn't to optimize harder. It's to stop optimizing past the point where the proxy stops correlating with the true objective.

A length-biased RM. Five candidates from short and accurate to long and useless. With the KL coefficient too small, the policy chases the RM's spurious length preference and concentrates on the worst response. The fix is not to optimise harder — it's to stop optimising past the point where the proxy stops correlating with truth.

Sycophancy

Labelers, on average, prefer responses that agree with the framing of their question. Trained on those preferences, the model learns to agree with the user's framing even when it's wrong. Ask Isn't it true that vaccines cause autism? and a sycophantic model leans toward the premise. Ask Don't you think this code is well-written? and a sycophantic model agrees that the buggy code is well-written. The RM was, in a real sense, trained to do this — the labelers really did prefer those answers, on average. RLHF inherits the labelers' biases at full saturation.

Sycophancy is harder to fix than reward hacking because the bias is in the label distribution, not in some quirk of the model. The only way to remove it is to either (a) train labelers to not have it, which is hard, (b) generate adversarial preference pairs that explicitly contrast sycophantic vs. corrective answers, or (c) use a stronger reward signal (verifiable tasks, AI feedback against an explicit constitution) that doesn't rely on the median labeler's intuitions.

The helpful-vs-honest tradeoff

A model can be helpful (try hard to answer) or honest (admit uncertainty when it has any). These pull in opposite directions. Pure helpfulness training rewards confident answers; pure honesty training rewards "I don't know." Most RLHF systems land somewhere in between, and the exact location is a product decision more than a technical one. It's why the same base model can feel confidently wrong or annoyingly hedge-y depending on how its post-training was tuned.

The InstructGPT paper measured this gap directly. On the TruthfulQA benchmark, RLHF made the model more honest about some things (less likely to give the popular but wrong answer) and less honest about others (more confident-sounding when it had no business being). The labeler instructions explicitly asked for honesty, but the actual signal — "which of these two responses is better?" — doesn't separate honest hedging from helpful confidence. The model finds the local optimum for the labelers' aggregate taste, which is somewhere in between.

What InstructGPT actually showed

The result that mattered: a 1.3B-parameter InstructGPT model was preferred by labelers over the 175B-parameter base GPT-3. Same architecture family, same data, more than 100× fewer parameters in the smaller model. RLHF moved the apparent quality of the system by more than two orders of magnitude of scale. That number is what made every other lab take this seriously.

The paper also reported gains on truthfulness (TruthfulQA up around 10 points), reductions in toxic outputs (RealToxicityPrompts down meaningfully), and — importantly — no major regression on academic NLP benchmarks. The fear before InstructGPT was that post-training would tax-out raw capability: you'd get a friendlier model that was worse at the things pretraining made it good at. The paper showed that wasn't the case at the scale they tested. The "alignment tax" was small and could be partially clawed back by mixing pretraining data into the SFT and PPO stages.

It's also what reframed the picture of where capability lives. Before InstructGPT, the working model was "big pretrained model = good assistant." After, it was "big pretrained model = capability bank, post-training = the part the user actually talks to." That distinction is now structural in how labs plan models. Pretraining and post-training are separate orgs with separate budgets and separate cadences. A pretraining run takes months; a post-training cycle takes days to weeks. New behaviors ship through post-training. The base model gets refreshed once or twice a year.

What's changed since 2022

RLHF as written in the InstructGPT paper has been picked apart and replaced piece by piece, but the three-stage shape has stuck. The biggest changes:

DPO and the death of the explicit RM

Direct Preference Optimization (Rafailov et al., 2023) made an observation that should have been obvious in hindsight. Remember the closed form π(y|x) ∝ π_ref(y|x) · exp(r(x,y)/β)? Solve that for r as a function of π and π_ref: r(x,y) = β · log [π(y|x) / π_ref(y|x)] + const. Now plug that expression for r* into the Bradley-Terry preference loss. The reward function vanishes from the objective. What's left is a loss directly on the policy:

L_DPO(π) = − E [ log σ( β · log[π(y_w|x)/π_ref(y_w|x)] − β · log[π(y_l|x)/π_ref(y_l|x)] ) ]

Read that as: "increase the policy's log-ratio over the reference for the winner, decrease it for the loser, weighted by β." No reward model artifact. No PPO loop. No on-policy sampling. You can train DPO with the same dataloader you'd use for SFT. It works. Open-source post-training largely converged on DPO and its variants (IPO, KTO, ORPO, SimPO) by mid-2024. The KL-to-reference-model is still in there — it's load-bearing — but you don't need a separate RM artifact, and you don't need PPO.

DPO is not strictly better than PPO. It can't easily incorporate on-policy samples, which means it can't fix problems the reference model has but the preference data didn't surface. PPO can. The big labs still run PPO-style loops because at sufficient scale, the on-policy rollout signal matters. But for almost everyone outside the frontier labs, DPO is the right default.

Constitutional AI and RLAIF

Constitutional AI (Bai et al., 2022, Anthropic) replaces the human comparison labels with the model's own self-critique against a written list of principles ("the constitution"). The training loop becomes: model produces a response, model critiques its own response against principle X, model rewrites the response. The rewritten responses become SFT data. Then a second model is trained on AI-generated comparison labels — the model picks which of two of its own responses better satisfies the constitution — and that becomes RM data.

RL from AI Feedback (RLAIF) is the more general name for using a strong model as the labeler. The catch is obvious: you've now offloaded all the bias risk from human labelers onto whatever bias is in the labeling model. The benefit is also obvious: AI labelers are 1000× cheaper and more consistent than human ones, and they can grade specific dimensions on demand — "is this response honest? helpful? polite? specific?" — with separate scores.

Verifiable rewards and the reasoning wave

For tasks where you can mechanically check the answer — math problems with numerical solutions, code that has unit tests, formal proofs — you don't need a reward model at all. Just run the test. RLVR (RL from Verifiable Rewards) is the term for this regime. It is what's powering the recent reasoning-model wave: DeepSeek-R1, OpenAI o1, and downstream descendants are essentially InstructGPT's PPO loop but with the RM replaced by a code grader or a math grader.

The shift matters because it changes what the optimiser is allowed to find. Against a learned RM, optimising hard finds reward hacks. Against a unit test, optimising hard finds programs that pass the unit test. The first is bad. The second is what you wanted. Verifiable rewards are why the reasoning models can train for orders of magnitude more compute without the RM going off the rails — there is no RM to go off the rails.

Process supervision and step-level rewards

The other axis of post-2022 progress is granularity. The original InstructGPT RM scored a whole response at a time. Process reward models (PRMs, Lightman et al. 2023) score the response step by step, giving credit for each correct intermediate step in a long chain-of-thought rather than only for the final answer. This makes credit assignment in long reasoning traces much easier and is one of the technical pieces sitting under modern reasoning models.

But underneath all the variants: pretrain, fine-tune to look like the right shape, then optimise against some scalar signal under a tether to the base model. That's still the recipe. The signal can be a learned RM, a constitutional self-critique, an AI labeler, a unit test, or a step-level grader. The tether can be a KL-to-reference, a clipped policy ratio, or both. The shape doesn't change.

If you read the original

The paper is unusually accessible because it's mostly methods and human evaluations, not theorems. Read Section 3 (the three stages) and Section 4 (the eval results, especially the labeler-preference numbers). Skip the appendix on the first pass — but come back to it. The appendix has the labeler instructions, the prompt-distribution breakdown, and the most honest discussion in the paper of how labeler demographics affected the result.

If you want to understand why the KL term is doing what it's doing, the cleanest exercise is to fine-tune a small model with PPO against a hand-written reward function (something simple, like "reward responses that contain the word 'banana'") with β=0 and watch what happens to the samples. Then turn β up. You'll see the policy first cheat the reward (every response becomes "banana banana banana"), then with β too high it stops moving at all. The healthy regime is the narrow band where the model still produces sentences but the sentences increasingly include bananas. That band is where production assistants live.

If you read one follow-up paper, make it the DPO paper. It clarifies the math the InstructGPT paper presented as RL machinery, shows that the RL machinery was always implicit in the preference loss, and gives you a much shorter path to running the recipe yourself. If you read two, add the Constitutional AI paper for the data-generation half of the story. Between InstructGPT, DPO, and Constitutional AI you have the entire post-training playbook circa 2024.

And if you want to understand why every chat model you've used has felt the way it does — a little hedge-y here, a little confident-wrong there, a little wordier than it needs to be — the answer is on the leash. β is somewhere. The RM has its biases. The labelers had their preferences. The model is doing exactly what it was optimised to do, on a proxy that wasn't quite the thing you wanted. RLHF didn't promise to solve that. It promised to give you a working dial. We are still figuring out where to set it.

Read the original Next: Learning Transferable Visual Models From Natural Language Supervision