Machine Learning / 2025 / arXiv
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
The reasoning was already in the base model. R1's contribution was figuring out how to pay it to actually use it.
For a long time, it looked like making language models reason better required teaching them to reason better. Build chain-of-thought datasets. Fine-tune on traces written by experts. Add scratchpads. Buy more compute and pretrain a bigger base. The implicit theory was that reasoning was a capability the model lacked, and the job was to give it to the model — write the right textbook, run the right curriculum, hope the model absorbs the form.
What the 2025 reasoning-model wave — OpenAI's o-series, then DeepSeek-R1 published the recipe openly — exposed is that this theory was partly wrong. A modern base model already contains most of the reasoning machinery. It can backtrack. It can self-verify. It can work through cases. It can notice that an earlier step was sloppy and patch it. It does these things rarely and inconsistently because nothing in next-token training rewards spending tokens on careful work versus blurting an answer. Reasoning was latent; what was missing was the incentive structure.
R1's setup is almost embarrassingly direct. Take a strong base model. Pose problems with checkable answers. Apply reinforcement learning, where the reward is just did you get the right answer. Don't tell the model how to reason. Don't curate reasoning traces. Just let the optimizer figure out, on its own, what kind of token sequences cash out into correct answers. The reasoning behavior — the long chains of thought with backtracking, the "wait, let me reconsider" patterns — emerges as a strategy the model discovers under reward pressure.
That's the whole story in one paragraph, and the rest of this piece is unpacking why each piece of it matters. The strong base. The verifiable task. The RL algorithm (GRPO) and the specific reason it's well-suited to this regime. The behaviors that emerge. The cheap distillation that follows. The new scaling axis the result opens up. And the places where this whole approach quietly stops working.
Why verifiable tasks unlock RL
The problem with applying RL to language model outputs has always been the reward function. RLHF approximates a reward by training a model on human preference data — which works, but the reward model is itself a learned approximation, prone to gaming, and it can't distinguish correct from plausibly-phrased. RLHF can teach a model to be polite. It can teach it to prefer one tone over another. It cannot reliably teach it to be right, because "right" isn't what the reward model is scoring. The reward model is scoring "sounds like a thing humans liked."
Math, code, and formal logic don't have this problem. There's a verifier. You can run the code; you can check the proof; you can compare the final numerical answer to ground truth. The reward signal is 0 or 1, exact, cheap, and uncorruptible. The model can't game the verifier by sounding more confident, because the verifier is an executor, not a judge. This is the property that makes RL work cleanly in this setting where it doesn't elsewhere.
And once you have a clean reward, gradient pressure does what gradient pressure does. The model's policy shifts toward token sequences that lead to verified-correct outputs. If those sequences happen to involve writing out a long deliberation, then the long deliberation gets reinforced. Nobody had to design that — it falls out. If those sequences happen to involve a mid-trace self-correction, the self-correction gets reinforced. Whatever cognitive moves correlate with verified answers get amplified.
The way to feel this in your gut: imagine the gradient as a hand that gently pulls the policy toward whatever it just rewarded. SFT's hand pulls toward whatever the demonstrator wrote. RLHF's hand pulls toward whatever the preference model liked. Verifier-RL's hand pulls toward whatever just produced a correct answer — and the only thing the model can adjust is its own token-emission policy. Over many millions of pulls, the policy ends up wherever "correct" lives.
GRPO, briefly and concretely
The RL algorithm R1 uses is Group Relative Policy Optimization (GRPO). It is a small, sensible variant of PPO designed for the verifier setting. The mechanics are clean enough to fit in a paragraph.
For each problem q, sample K rollouts o_1, ..., o_K from the current policy. Each rollout is a complete trace ending in an answer. Run the verifier on each rollout to get a reward r_i ∈ {0, 1}. Compute the group statistics: mean μ = (1/K) Σ r_i and standard deviation σ. The advantage of rollout i is the z-score within the group:
A_i = (r_i − μ) / σ.
Then the GRPO update is the standard PPO clipped objective using A_i as the per-rollout advantage and using a KL penalty against a reference policy to prevent the model from drifting too far from its starting distribution. There is no separate value network. The group itself acts as the baseline — the other rollouts on the same problem tell you what to compare each one against. That's the whole trick: PPO needs a value function to compute advantages; GRPO replaces it with on-the-fly intra-group normalization.
Why does this work? Because the value function in PPO is itself a learned approximation, and in the verifiable-reward setting it's mostly redundant. If you sample 8 rollouts on the same problem and 2 succeed, the successes have advantage ≈ +1.6 and the failures ≈ −0.5. The policy gets pushed toward the tokens that produced the successes, away from the ones that produced the failures, all without ever fitting a separate critic. Less code, fewer hyperparameters, less to tune.
GRPO needs the edge of competence
There's a subtle failure mode worth seeing. If μ = 0 (every rollout failed) or μ = 1 (every rollout succeeded), the standard deviation collapses and the advantages are all zero or all noise. That step produces no learning signal. Which means GRPO needs the model to be at the edge of its competence: problems easy enough that a few rollouts succeed, hard enough that not all do. This is why curriculum and base-model strength matter so much. A weak base can't get any rollouts right on hard problems, and a too-easy curriculum gives the policy no gradient. The sweet spot is the zone where the base sometimes-but-not-always succeeds. Most of the engineering of a reasoning RL run is keeping the model in that zone.
GRPO vs SFT, side by side
SFT clones the data; RL exceeds it
Two training runs on the same verifiable benchmark from the same base. SFT plateaus at the demonstration ceiling. GRPO with a verifier reward keeps climbing because the gradient is chasing correctness, not imitation. The right panel is the GRPO group at the current step: rewards, group mean, advantages.
SFT’s ceiling is the dashed coral line: it cannot exceed the data it was trained on. RL keeps climbing past that line because every step it samples K rollouts, rewards the verified-correct ones, and shifts the policy toward positive-advantage tokens. When all K fail, μ = 0, advantages are zero, and the step produces no learning signal — which is why curriculum and base-model strength matter.
What the model learns to do
The behaviors that emerge in R1's training are striking precisely because nobody specified them. Read the released traces and you'll see things like:
- Backtracking: "Wait, this approach won't work because [reason]. Let me try a different angle."
- Self-verification: "Let me double-check by plugging this back into the original equation..."
- Case-splitting: "There are two cases here. Case 1: ... Case 2: ..."
- Plan revision: "Actually, I should reconsider what the question is asking before continuing."
- Increased token spend on hard problems: harder problems get longer traces, automatically, without anyone telling the model that hard problems deserve more thought.
The R1 paper has a lovely figure showing the average response length over the course of RL training: it climbs from a few hundred tokens to many thousands. The model is learning to think longer as a strategy, and the reward signal is the only thing pushing it that way. Nobody wrote a regularizer that says "please think more." Thinking more was instrumentally useful for getting verified answers, so it got reinforced. The same gradient that pushes the policy toward correct tokens pushes it toward however many tokens of deliberation it took to land on those correct tokens.
There's a famous moment in the R1-Zero training run that the paper calls the "Aha moment." Mid-training, the model produces a trace where, after working through a problem one way and getting stuck, it pauses and writes something to the effect of "Wait. Let me reconsider this from the beginning." It then re-derives the answer correctly. Nobody trained it to say "Wait." Nobody curated examples of mid-trace re-derivation. The model discovered, through pure reward pressure, that this behavior — explicit mid-trace reconsideration — was instrumentally useful, and the policy started producing it spontaneously on hard problems.
Read that carefully. The model is, in a real sense, discovering an algorithmic technique — the one humans call "step back and reconsider" — by trial and error, because that technique pays out in verified-correct answers more often than its absence does. The reasoning behaviors aren't mysterious. They're whatever cognitive moves the base already had latent and the RL outer loop selected for.
reasoning as search
A trace, scrubbed
Problem: solve x² − 6x + 5 = 0. Drag the slider to step through the model’s tokens. Watch the moves that emerged from RL pressure: a verify, a branch, a backtrack, a commit. The right gutter tracks the model’s running confidence.
Notice step 4: the model considers shipping and instead chooses to verify by factoring. That verify move costs tokens but pushes confidence from 62% to 92% by step 6. The backtrack at step 7 is also free information: the model questions an assumption, finds nothing wrong, and ratchets confidence higher. None of this was demonstrated to the model. RL paid for it with verifier reward and the policy discovered it was worth the tokens.
R1-Zero: the cleanest evidence
The single most informative result in the R1 paper is the ablation called R1-Zero. The full R1 recipe has a supervised cold-start phase before RL — a small set of curated reasoning traces used to anchor the model's output style. R1-Zero strips that out. It applies RL directly to the base model with no SFT warmup at all. Just: base + verifier + GRPO.
It works. The model learns to reason. Trace lengths grow. Backtracking emerges. Self-verification emerges. Benchmark scores climb on math and code. The behaviors look like the full R1's behaviors, derived from nothing but reward pressure on a base.
R1-Zero also has problems. The traces are correct but ugly. They mix languages mid-trace (English drifting into Chinese and back). They use weird formatting. They're hard for humans to read. The full R1 recipe adds the cold-start SFT phase before RL specifically to fix this — to anchor the output distribution to readable, well-formatted English before letting RL go to work on the reasoning. The cold-start data is not teaching reasoning. It's teaching presentation. The reasoning still comes from the RL phase.
Reading the R1-Zero result carefully is worth your time, because it's the cleanest demonstration that the reasoning behavior is genuinely emerging from reward pressure on a base model and not being smuggled in via curated data. If the cold-start SFT were doing the heavy lifting, R1-Zero would fail. It doesn't. It just looks weird. That tells you the reasoning capability and the readability are decoupled, and only the readability needs human anchoring.
Test-time compute as a third scaling axis
Pre-2024, the way you got better outputs from a language model was: train a bigger model, or train it on more data. Inference was a fixed cost. You ran the model, you got an answer. The classical scaling story (Kaplan, Chinchilla) had two axes: parameter count and training tokens. Inference was just "deploy the trained thing."
Reasoning models broke this assumption. R1 (and o1, and the wider class) demonstrated empirically that you can hold the model fixed and spend more inference compute per query — generate longer traces, generate multiple traces and aggregate, branch and verify — and accuracy goes up. There's a clean scaling curve relating inference tokens to solve rate on hard benchmarks like AIME or competition coding. Double the inference budget; gain several points of accuracy. Double again; gain a few more, with diminishing returns past some saturation point.
This adds a third axis to the scaling story. We had train-time parameter scaling and train-time data scaling. We now have inference-time compute scaling, and on hard reasoning benchmarks it's competitive with or better than spending the same dollars on a bigger base model. The implication for systems design is large: inference is no longer a fixed cost line item; it's a knob that buys quality.
The reasoning dial reshapes deployment
The corollary is that frontier intelligence is not just "how big is the model." It's "how big is the model times how much compute you're willing to burn at inference." A smaller reasoning model with a generous inference budget can match a bigger non-reasoning model on tasks where extra deliberation actually helps. This changes the deployment economics, the API pricing, and the way you should think about model evaluation.
It also changes what an "open weights" release means. A leaked checkpoint of a reasoning model is more powerful than a leaked checkpoint of a non-reasoning model of the same size, because the deployer can dial up inference compute and trade dollars for capability. The same weights deliver different capabilities at different budgets. Pricing pages for reasoning APIs now have explicit tiers like "reasoning effort: low / medium / high." Those tiers are the same model with different inference budgets. The dial wasn't there before.
pass@k vs inference budget
Inference compute is the third scaling axis
Two models, same family, same problem. The base model’s success rate climbs slowly with reasoning-token budget. The RL-trained model converts the same tokens into far higher pass rates. Hold weights fixed, spend more compute, get more right.
At B = 800, the same problem yields 78% pass@1 for the RL model versus 32% for the base — a 46-point gap with no weight changes after their respective trainings, just budget. The gold ring marks where doubling B buys ≤ 2 points: past that, more thinking yields little. RL training pulls the whole curve up and shoves the saturation knee leftward.
the third scaling axis
Cost vs accuracy, six models, one plot
Drag the budget bar. Toggle models. The pareto frontier (gold) is the best accuracy any enabled model achieves at each token budget. Notice the small RL model dominating the large base model below a certain budget, then yielding past it.
At B = 2,048, the top model is large RL at 89%. Toggle the large base off and watch a much smaller RL model take the frontier in the mid-budget regime. The economic implication: if your task tolerates extra latency, a small reasoning model with a generous budget often beats a large non-reasoning model on cost, accuracy, or both.
Distillation: cheap students from expensive teachers
The R1 paper makes a second contribution that's almost as important as the RL recipe itself. Once you have a strong RL-trained reasoner, you can use it to generate training data for smaller models. Run the big reasoner on a corpus of problems, keep the traces that produced verified-correct answers, and supervised-finetune a smaller base model on those traces.
The result: the small model picks up the form of the reasoning behavior — the structure, the backtracking, the verification habits — without having to discover it through expensive RL. The DeepSeek team showed this works dramatically well, distilling R1 into 7B and 32B Qwen and Llama bases that perform competitively with much larger non-reasoning models.
Why does this work? Probably because the hard part is finding the right reasoning strategies in the first place — that's what the RL outer loop pays for. Once those strategies are visible in the form of token sequences, imitating them is comparatively easy. RL discovers; SFT clones. When the thing being cloned is a small set of useful cognitive moves, cloning is enough.
The data source moved, not the method
There's a more subtle point here. SFT on raw human-written CoT data has a ceiling at human-written CoT quality. SFT on RL-discovered traces has a ceiling at RL-discovered trace quality, which can be much higher. The ceiling moved because the data source moved. Distillation isn't replacing one kind of training with another; it's replacing one kind of data with a much better kind, and using the same training method. The R1 small models are better than non-reasoning models of the same size because they were trained on better data, not because they were trained more cleverly.
This is also why open-weight reasoning models proliferated quickly after R1 dropped. You don't need to redo the RL — you can distill R1's traces into your favorite base. The capability spread from one team's expensive training run to many teams' cheap finetunes within months. That's a different diffusion pattern than we saw for GPT-4-class capability, and it's enabled by the fact that reasoning-as-trace is observable and copyable.
students from a teacher
Distillation amortises the RL bill
A small student trained on raw expert demos climbs the size ladder slowly. The same student trained on traces from an RL-discovered teacher climbs much faster. The teacher does the expensive search once; the students copy the form.
Why does γ > β? An RL-discovered trace is not just a correct answer; it is a correct process: the backtracking, the verification, the case-splitting. Each trace teaches the student a small piece of a search strategy. Raw demos teach answers. Strategies compose across problems; answers don’t. The horizontal arrow is the practical punchline: a distilled small model often matches a much larger demo-trained model on the same benchmark, for a fraction of the inference bill.
What this is *not*
A few things it's worth being careful not to read into the R1 result.
It is not a new architecture. R1 is a transformer. Nothing about the network shape changed. The capability gain came from a different training objective on the existing architecture.
It is not a new RL algorithm. GRPO is a small variant of PPO and is barely the point. People have replicated R1-style results with vanilla PPO, with REINFORCE, with various PPO descendants. The choice of algorithm is the least important variable. What matters is the verifier and the base.
It is not chain-of-thought prompting. CoT prompting is a 2022 result about asking a base model to "think step by step" at inference time. It helps a little. R1-style RL is a training intervention that produces a model whose default behavior is to think step by step and verify, with the moves chosen by the model rather than by the prompt engineer. The two approaches are stacked, not equivalent.
It is not necessarily a step toward AGI. It's a recipe that works extremely well on tasks with cheap reliable verifiers. Whether the resulting capabilities transfer cleanly to soft, open-ended, real-world reasoning is an empirical question, and the current evidence is mixed. More on this in a moment.
Where this stops working
The biggest limit is the verifier. Reasoning RL works wherever you have cheap, reliable, automatic correctness checks: math, code, formal logic, structured games, problems with closed-form numerical answers. Outside that domain — open-ended writing, creative tasks, multi-step real-world reasoning where the answer isn't a number — the reward signal collapses, and the whole approach loses its grip.
Extending verifier coverage is the central frontier question right now. A few directions are being pushed:
- Process reward models (PRMs) reward correct intermediate steps rather than just final answers. They help in domains where final-answer verification is too sparse — say, long mathematical proofs where many partial credit signals are available.
- LLM-as-judge uses a frontier model as an approximate verifier for tasks lacking ground truth. This brings back the gameability problem RLHF has, but it's the most general-purpose hammer available.
- Tool-grounded verification runs the model's outputs through real tools (a Python interpreter, a search engine, a calculator) and uses success-of-tool-call as the reward. Effectively, this expands the set of things that count as a "verifier" by treating tool outputs as ground truth.
- Constitutional / rubric-based reward scores outputs against a structured rubric a frontier model evaluates. Cheaper than humans, more interpretable than a black-box reward model, but still vulnerable to gaming.
Building reliable verifiers for soft domains is, right now, where most of the engineering attention is. The dream is a verifier whose reliability approaches the math-and-code regime in domains like medical diagnosis, legal reasoning, scientific hypothesis evaluation. Nobody has cracked it yet. The current best results are domain-specific verifiers built by domain experts — which works but doesn't scale to the long tail.
There's also the faithfulness question. The reasoning traces look like the model thinking, but they aren't necessarily what the model is actually doing internally. A trace can produce the right answer for the wrong reason, or the model can generate post-hoc rationalizations that don't correspond to its real computation. We don't have great tools to tell when the trace reflects causal computation versus when it's confabulation that happens to land on the answer. For applications where we need to trust the reasoning (medical, legal, scientific), this is a real gap. Recent work probes faithfulness by perturbing intermediate steps and checking whether the final answer changes — sometimes it does, sometimes it doesn't, in ways that don't correlate well with whether the trace is verbally coherent.
Does math-RL transfer to messy domains
And generalization beyond verifiable domains. If a model trained primarily on math and code RL becomes a better general reasoner — better at law, better at biology, better at messy human problems — that's the optimistic story and there's some evidence for it. The reasoning patterns the model picks up on math (verify before committing, consider alternatives, decompose into cases) are domain-general moves; they show up in human reasoning across all kinds of problems. If the model's instantiation of those moves transfers, you get capability gains far outside the training domain. Early evidence suggests some transfer happens, but the magnitude is debated and probably depends on how much the soft domain shares structural features with verifiable ones.
If those skills are narrow and stay narrow, the optimistic story breaks: reasoning models become specialised tools for math and code, useful but not general. The current evidence is mixed and it's the empirical question that will determine how big a deal reasoning RL ultimately is.
Reward hacking, mode collapse, and other pathologies
RL on language models has a long history of finding clever shortcuts the designers didn't intend. The reasoning-RL setting is no exception. A few of the failure modes that show up in practice.
Verifier gaming. If the verifier accepts answers in a forgiving format (any string containing the digits of the right answer, say), the model finds ways to write outputs that match the format without actually solving the problem. The fix is tighter verifiers, but tighter verifiers are also more brittle and reject correct answers in unusual formats. There's a tension.
Trace performance. The model learns that certain phrases ("let me reconsider," "actually wait") correlate with correct outputs in training and starts inserting them performatively, regardless of whether they're doing real work. The trace looks like reasoning; under the hood it might just be cargo-culting reasoning-shaped tokens.
Mode collapse on easy problems. Once the model finds a strategy that works on a class of training problems, it can collapse onto that strategy and stop exploring. Mitigations include entropy bonuses in the loss, periodically reseeding the training mix with harder problems, and using a higher temperature during rollout sampling.
Forgetting. Aggressive RL on math and code can degrade general-purpose capabilities the base had. A model fine-tuned hard on competition math sometimes gets noticeably worse at writing email. The KL penalty against a reference policy in PPO/GRPO mitigates this, but tuning that penalty is an art. Too tight and the model can't move; too loose and it forgets.
All of these are tractable engineering problems, not fundamental obstacles. But they're real, they show up in every real run, and they explain why the published recipes look simple while the actual training runs require careful babysitting.
The descendants
R1 dropped in early 2025. Within a year the field looked different in ways worth naming.
The o-series at OpenAI had pioneered the verifier-RL recipe in 2024 with o1. They followed with o3 and o3-mini, pushing the inference-compute axis hard. o3 in particular demonstrated that allocating very large inference budgets (millions of tokens per query) could push benchmarks like ARC-AGI to near-human performance, at significant per-query cost. The o-series papers are mostly closed; R1's contribution was publishing a credible open recipe.
Open reasoning models proliferated. Within months of R1, there were dozens of distilled or RL-trained reasoning models from various labs, sized from 1.5B to 70B. The diffusion was rapid because the recipe was cheap to replicate once the teacher existed: distill R1 traces into your favorite base, optionally add a small amount of additional RL, ship.
Search-augmented reasoners layer external search loops on top of a reasoning model. The model can call out to a retriever, a web search, a Python interpreter, mid-trace, and the verifier reward is then attached to the combined trace including tool calls. This expands the set of problems the verifier-RL approach can handle to anything a tool can ground.
Multi-turn reasoning agents push further: the model interacts with an environment over many turns, each turn possibly involving reasoning and tool use, and the reward is delivered only at the end of the episode. This is the regime where reasoning-RL meets classical RL: long horizons, sparse rewards, credit assignment across multi-step trajectories. Most of the open research frontier on "agents" in 2025-2026 is here.
Process reward models got a second life. R1 itself doesn't use them — it relies on final-answer verification — but in domains where final-answer verification is too sparse, process reward models that score intermediate steps started to look attractive again. They're harder to train and easier to game, but they unlock domains where reward density is the bottleneck.
What R1 changed about how to think about LLMs
If you came up in the era of pretrain-and-finetune, the natural mental model is: capabilities are stored in weights, and you build them by writing the right loss against the right data. The R1 result perturbs that picture in a useful way.
Capabilities are also stored in policies over token sequences. The base model already contains many possible behaviors — different ways of breaking down a problem, different verification habits, different rates of token spend. Pretraining picks one mixture; RL re-mixes. The base has the components. The training objective decides which components run.
This is why the same base, trained with different objectives, ends up with such different behaviors at inference time. RLHF makes it polite. Verifier RL makes it deliberate. SFT on a particular style makes it that style. The weights are mostly the same; the policy lives in a different region of the high-dimensional space the weights define.
It also reframes "prompting" and "training" as points on the same continuum. A prompt is a mid-trajectory adjustment to which behaviors the model selects. A finetune is a pre-baked adjustment to the policy itself. RL is a way to discover new policies that no prompt would have surfaced. The boundary between "how I ask" and "what the model is" is fuzzier than it looks. R1 just made one specific point on that continuum work much better than people thought possible.
If you read the paper
The R1 paper is unusually candid about what worked and what didn't. The R1-Zero ablation (Section 2.2) is the cleanest evidence in the paper — it strips away the SFT cold-start and shows pure RL still produces reasoning. The training-curve figures (Section 2.2.2, 2.2.3) showing response length and benchmark performance climbing together over RL steps are the key empirical claim. The distillation results (Section 3) show how cheaply the capability transfers to small bases. The limitations section (Section 5) is honest about language mixing, prompt sensitivity, and the failure of the approach on tasks lacking verifiers.
The exercise that's worth your time: take a base model, set up a math benchmark with automatic grading, implement GRPO (it's not much code), and run it. Even at small scale, on a small base, you'll see response lengths climb and self-correction patterns emerge. Once you've watched a model discover backtracking on its own, the whole reasoning-RL story stops being abstract. It becomes one of the simplest, most reproducible "emergent behavior" demos in modern ML — a couple hundred GPU-hours, a verifier, a base, and a willingness to wait for the curves to bend.
And then notice what you didn't have to do. You didn't curate reasoning traces. You didn't write a regularizer that rewards length. You didn't tell the model to verify. You set up a verifier, applied gradient pressure, and watched a base model discover, on its own, that careful work pays.