Machine Learning / 2020 / arXiv

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models — Kaplan et al.

Before this paper, training a frontier model felt like betting. After it, you could draw the curve on a napkin and quote a number.

Read original paper Back to papers

Imagine you have ten million dollars of compute and one question: how big should the model be? Pre-2020, the honest answer was a shrug and a folk theorem. People knew bigger usually helped. Nobody could tell you whether 13B parameters with these tokens would beat 6B with twice the tokens without actually training both. Pretraining was a craft, the kind of thing you got better at by training a lot of models and developing taste.

Kaplan et al. did the unromantic thing. They trained hundreds of small and medium language models — different widths, different depths, different dataset sizes, different compute budgets — and plotted the resulting test loss on log-log axes. The plots were, embarrassingly, straight lines. Loss falls as a power law in each of the three knobs, over many orders of magnitude, with no funny business at the boundaries.

That single empirical fact is what made the whole post-2020 industry possible. If you can fit a clean curve at small scale, you can extrapolate to the giant scale and quote a loss before you spend the budget. Pretraining went from craft to forecast. The question stopped being will the bigger model be better and became by how many nats per token, and how much compute will it take to get there.

This article walks through what the paper actually said, why the result is so unreasonably useful, where the original recipe was wrong, and where the framework still applies in 2026 — including the awkward bits, like reasoning models and inference-time compute that the original three-axis surface had no axis for.

A small disclaimer up front: the rest of this article uses a mix of the original Kaplan exponents and the Hoffmann/Chinchilla refit, depending on which is more honest in context. The interactive demos use Hoffmann's joint form (L = E + A/N^α + B/D^β with E ≈ 1.69, A ≈ 406, α ≈ 0.34, B ≈ 411, β ≈ 0.28) because that's the form that holds up when you re-run the experiments carefully. The intuitions are the same. Only the precise slopes differ.

The engineering problem the paper solved

Forget the theory for a moment. The real question this paper answered is operational: I have a budget. How do I spend it? A frontier pretraining run in 2026 costs somewhere between $50M and $500M. The architecture decisions you make on day one — how many parameters, how many tokens, what the learning rate schedule looks like — bake in a loss outcome that you don't get to renegotiate six months later when the cluster is hot.

Before scaling laws, the way you decided was: train a 1B model, look at it, train a 6B model, look at it, train a 13B model, run out of money, ship whatever you have. The decisions were ordinal. Bigger is better, more data is better, train longer is better, your move. When two of those compete — say, you have to choose between a 70B model trained on 1T tokens and a 30B model trained on 3T tokens — the ordinal answer doesn't help. You need numbers.

The Kaplan paper turned the budgeting problem into arithmetic. You fit the curve on small runs that fit in a corner of your cluster. You read off the predicted loss for the run you actually want to do. If two configurations sit at the same compute cost but at different points on the surface, the curve tells you which one ends up at lower loss. The frontier-lab planning meeting changed shape: here is the curve, here is our position on it, here is what moves the position.

What a power law actually buys you

A power law in loss, in its simplest form, looks like L(N) = (N_c / N)^α + L_∞ where N is parameter count, N_c is a fitted constant with units of parameters, α is a small exponent (around 0.07–0.1 for parameters in the original paper, larger in later refits), and L_∞ is the irreducible loss — the entropy of the data the model can never beat no matter how many parameters you give it.

The intuition is more useful than the formula. On log-log paper, every doubling of N shaves a fixed amount off log-loss above the floor. Double parameters, get a bump. Double again, get the same bump. There is no magic threshold and no diminishing-returns elbow inside the regime they tested. The curve does not curve.

That last sentence is the entire reason capability forecasting works. If the curve had a knee, you could be one knee away from disaster every time you scaled. Because it doesn't, the loss at 100x your current run is a quantity you can write down with a ruler. People draw the line on log-log paper and read across. That is, mechanically, what the field has been doing since 2020.

The functional form has three things in it and each one means something physical. The (N_c/N)^α term is the capability deficit from not having enough parameters to represent what the data is asking you to learn. As N grows, that deficit shrinks like a power law. The L_∞ term is the bottom of the well: the part of the prediction problem that is genuinely irreducible, because human text is generated by a process with non-zero entropy. The α exponent is the slope on log-log paper. A larger α means each doubling of parameters helps more; a smaller α means you need bigger and bigger jumps to keep moving.

Why these three components and not something else? Because the data fits. The empirical move of the paper was to not propose a model and check, but to fit families of curves and pick the simplest one that explained the points. The fit is so clean across so many orders of magnitude that the function form effectively writes itself.

Same shape, three resources

There are three resources, not one. Parameters N. Training tokens D. Compute C ≈ 6ND. Each has its own power law against loss when the others are not the bottleneck. The clean experimental discovery in the paper is that all three give straight lines on log-log paper, with their own slope and their own constant.

Before you take that for granted: it could easily not have been true. Loss could have been smooth in N but jagged in D, or vice versa. There could have been a regime change at some particular dataset size. Empirically, no. All three are clean. That is the experimental observation that earned the paper its citations.

log L − E vs log resource

Three resources, three straight lines

Move the held-fixed sliders. The curves shift vertically — the intercept changes — but the slope stays put. The exponent on each resource is a property of the data and the architecture, not of the scale you happen to be at.

Held fixed: D (tokens, for L-vs-N)199.5B

Held fixed: N (params, for L-vs-D)10.0B

L vs N (parameters)

D fixed: 199.5B tokens

L vs D (tokens)

N fixed: 10.0B params

L vs C (compute-optimal)

along: N* (C), D* (C)

The compute panel is special. There is no slider for what is held fixed because nothing is — at every C we walk to the best (N, D) the law allows. That is the curve frontier labs actually live on.

Three side-by-side log-log plots of the same family of fits, holding the other resources fixed. Drag the held-fixed sliders. The lines slide up and down — the intercept changes — but the slope stays put. The exponent on each axis is intrinsic to the resource.

The compute panel is special. It walks the compute-optimal frontier: at each compute budget C, you pick the (N, D) pair that minimises loss subject to 6ND = C. The slope of loss along that frontier is the exponent that actually matters for budgeting, because frontier labs spend on compute, not directly on parameters or tokens.

Three knobs, one ridge

The catch in power law per resource is the when others are not the bottleneck clause. If you make the model huge but feed it a thimble of data, parameters stop mattering — the data law pins you. If you stuff a tiny model with a planet of tokens, the parameter law pins you. The frontier of useful spending is a narrow ridge where all three knobs are roughly co-bottlenecking. Off the ridge, marginal compute buys you nothing.

This is constrained optimisation, not bigger is better. It is the kind of result an economist would have written down first, except the economists don't have to physically rent the GPUs.

Below: drag the allocation between parameters and tokens and watch what happens to the loss surface. The ridge is the diagonal where neither axis is starved. Push too far in either direction and the predicted loss flattens — you are paying to keep idle capacity warm.

L(N,D) = E + A/N^α + B/D^β

A loss surface, an iso-FLOP ridge, one optimum

Drag N (parameters) and D (tokens). The dashed coral curve is the set of (N, D) pairs that cost the same compute as your current point. The gold dot is where loss bottoms out along that curve — the compute-optimal allocation for your budget.

Parameters N31.6B

Tokens D199.5B

Loss (nats/tok)2.081

Compute C ≈ 6ND3.79 × 10^22

Gap to optimal+0.025

For your compute budget the compute-optimal point sits at N ≈ 9.4B and D ≈ 672.2B tokens — that’s about 72 tokens per parameter. Move off the gold dot in either direction along the coral curve and loss rises: too few tokens for the model, or too small a model for the tokens. The Chinchilla correction was, mechanically, this dot moving.

The clean curves only hold along the ridge. Most of the surface is regions where one resource is starving the others.

It is one thing to look at a heatmap and another to feel which knob is hurting you. The next demo answers a sharper question: if I am at point (N, D), which of the three loss terms is dominating? Click anywhere. The bar on the right decomposes the loss into the irreducible term, the parameter-deficit term, and the data-deficit term.

click anywhere · L = E + A/N^α + B/D^β

Which knob is the bottleneck?

Click any point on the surface. The panel on the right decomposes the loss into the irreducible term, the parameter-deficit term, and the data-deficit term, and tells you which one is hurting you most.

Above the teal ridge you are data-rich, model-poor: spending on more tokens won’t move the needle, every dollar should buy more parameters. Below the ridge you are an undertrained giant: a model big enough to cure cancer, fed enough text to learn breakfast. The original Kaplan recipe lived below the ridge. Chinchilla moved the recommendation onto it.

Click any point on the surface to see which resource is the binding constraint. The teal ridge is the locus where the parameter-deficit term and the data-deficit term are equal — the place where every marginal FLOP is split fairly between make the model bigger and show it more data.

Kaplan's original recipe said to spend most additional compute on bigger models, with comparatively modest token increases. Roughly: N grows like C^0.73, D like C^0.27. That recommendation was wrong, in a productive way: two years later Hoffmann et al. ran Chinchilla and showed the ridge slope was off — frontier models had been systematically too big for their token budgets. The shape of the law was right; the slope on one axis was mis-fit. That correction is itself a citation to how powerful the framework is. You can be wrong about the coefficients and still get the architecture of the answer right.

Why this works at all

It is genuinely a little weird that loss is smooth in scale. Neural networks are not smooth functions of their weights, training is stochastic, and the data is full of structure at every scale. Why should aggregate test loss be a clean power law?

The handwavy answer is that natural-language data appears to have a heavy-tailed distribution of patterns: a few extremely common motifs (the, of, and, basic syntax), many rare ones (technical jargon, named entities), a long tail of one-offs (a specific person's email signature). Each doubling of model size lets you pick up a roughly fixed slice of that tail. The exponent α is small because the tail is long; you never run out of new patterns to learn, you just learn rarer ones more slowly.

This is the kind of result that is more comfortable as an empirical observation than a theorem, and the theory side of the field has been chasing a clean derivation ever since. There are partial results — for instance, Sharma & Kaplan (2022) tied the exponent to the intrinsic dimension of the data manifold, and there are renormalisation-flavoured arguments in the lottery-ticket and feature-learning literature — but no one has produced L(N, D) = ... from first principles in a way that the field accepts. We have the curve. We don't entirely know why.

What you should take away: the smoothness is what makes the engineering work. As long as it holds, every dollar of compute has a known marginal value in loss. You can plan.

Zipfian tails as the underlying mechanism

There is also a cleaner way to state the same intuition. Suppose you have a fixed amount of information in the data — a certain irreducible entropy plus a certain amount of learnable structure. The model's job is to recover that structure. If structure comes in chunks of differing rarity, and a doubled model picks up roughly a fixed-fraction more of the rarity tail, you get geometric returns: each doubling shaves a fixed log-amount off the remaining gap. Geometric on log axes is linear on log-log axes, which is the power law. The whole picture is consistent with how Zipfian distributions behave in classical information theory; it's just that nobody has nailed down which assumption produces exactly the observed exponent.

The other intriguing piece is that the exponent depends on architecture only weakly. Kaplan tested width, depth, learning-rate schedules, and dataset compositions, and the exponent on N moved by maybe 10-20% across those choices. The implication is unsettling: the slope of the law looks more like a property of language data than a property of transformers specifically. Convolutional language models, RNNs, and state-space models all follow nearby curves, with different prefactors but similar slopes. The dominant variable is what you're learning, not how you're learning it.

Forecasting, in practice

Here is how this gets used inside a lab. You do twenty or thirty cheap small runs at 100M, 300M, 1B, 3B parameters with various token counts. Each one costs maybe a few thousand dollars. You fit the joint power law to those points. Then you read off the loss at the scale you actually want to run — which costs hundreds of millions. The ratio of fit cost to forecast cost is something like 1:1000.

It is fair to ask: how good is that forecast? Extrapolation is usually the dangerous thing. The defence is empirical. People have repeatedly fit on small runs and verified on larger runs and the residuals are tiny. Not zero — you'll see noise of order 1–3% on the loss — but small enough that decisions don't flip.

fit small, predict big

How far can you extrapolate a fit?

Move the cutoff to choose how many cheap small runs the fit gets to see. Everything to the right is held out — the dots there show the residuals you would book against your forecast.

Fit on runs with C ≤ this cutoff10^20.50 FLOPs

Fitted γ (compute exponent)0.160

Train RMSE0.0178

Held-out RMSE0.0523

With a low cutoff the fit is locked in by a tiny lever arm; the held-out residuals can drift to a few hundredths of a nat per token. Push the cutoff right and the residuals collapse into the noise floor. This is exactly what frontier labs ride: cheap small runs, then quote the big one with a measured error bar.

Slide the cutoff to choose how much cheap small-run data the fit gets to see. Everything to the right is held out — those are the runs whose loss the fit has to predict without seeing them. The bottom panel shows the residuals, which collapse as the fit gets more lever arm.

The gamma in the readout is the compute-optimal exponent — the slope of loss vs log compute along the ridge. Notice how stable it is across cutoffs. That stability is the property the whole industry rides. If γ wandered every time you re-fit, frontier labs couldn't budget. It doesn't.

Concretely: a typical fitting protocol now is to run a staircase of small models — say ten configurations from 100M to 3B parameters, each at the compute-optimal token count for its size, and an equal number of slightly-undertrained or slightly-overtrained variants to break the (N, D) degeneracy. Each run is a few thousand dollars. Total fit cost: maybe $100k. The fit predicts loss at the planned $200M run to within ~1%. That's the trade. You spend 0.05% of the run cost on the forecast and you get to walk into the planning meeting with a number.

There is a subtle point about fitting the irreducible term. With short lever arms, the fit can't tell the difference between the floor is at L=1.7 and we have a steep curve and the floor is at L=2.0 and we have a shallow curve. They both pass through the same handful of small-run points. As the lever arm extends, the floor pins down. This is exactly why fitting on short ranges is brittle and why labs aggressively want to push their cheap runs as far as the cluster will let them.

This trick of fit small, predict big is doing a lot of work in the modern AI economy. Investment decisions get made on these forecasts. Compute purchases get made on these forecasts. Multi-year roadmaps get made on these forecasts. When you read a prediction like GPT-N will reach loss X by year Y, what's underneath is a power-law fit done on internal small runs, extrapolated several orders of magnitude, with all the usual caveats about whether the law holds at that scale. Most of the time it does. The interesting failures — where the curve bends or shifts — are the ones that make the news.

What the law doesn't predict

It is tempting to read the scaling-laws paper as saying capability is forecastable. That overshoots. The paper says pretraining loss is forecastable. Several things are downstream of that and behave less politely.

Benchmark accuracy is not loss. A model can chip away at log-loss while a multiple-choice score sits at chance, then jump when the right token's probability finally exceeds its competitors'. That's where the 'emergent capabilities' debates come from — sometimes it's a real phase change in the task, sometimes it's an artifact of a non-smooth metric on top of a smooth loss curve.
Data quality is missing entirely from the original equations. They assume D tokens of roughly stationary text. Replace half your tokens with garbage and the curve breaks. Replace them with high-quality curated text and you slide along the curve faster than the law predicts. Modern frontier work is largely a data-quality story the original axes can't see.
Post-training — SFT, RLHF, RLAIF, DPO, the whole instruction-following stack — operates on a model after the loss curve has done its work. The pretraining law tells you how good the base model's next-token prediction is. It doesn't tell you how helpful the assistant will be after a million preference comparisons.
Inference-time compute. The 2024–2025 reasoning model wave (o1, R1, and friends) showed that you can spend compute at test time — letting the model think longer in chain-of-thought — and get more capability without touching the pretraining curve at all. The original three-axis surface had no axis for that.
Tool use, retrieval, agents. Anything where the model's effective context is augmented by external state lives outside the law. The law assumed a closed system: parameters, tokens, compute, loss. The frontier is increasingly an open one.
Mixture-of-experts. The simple parameter count N stops being a useful axis when only a fraction of parameters fire on any given token. You have to talk about active parameters and total parameters separately, and the prefactors in the law shift.

None of this invalidates the original result. It just bounds where the result applies. The pretraining loss curve is still there, still smooth, still useful for budgeting the base model. It's just one curve in a stack that has gotten taller.

Emergence: real, or a metric artifact?

The most heated argument that came out of scaling-laws-land is whether downstream capabilities emerge — appear suddenly at a threshold, in a way the smooth loss curve doesn't predict. The empirical fact is that some benchmarks do show step-function behaviour: chance, chance, chance, bang, well above chance. The question is whether those steps come from real phase changes in the model's internals, or from a discontinuous metric (like multiple-choice accuracy) being dropped on top of a smooth probability.

same model, two metrics

Smooth or emergent? Pick a metric.

Loss is the prediction objective and falls smoothly. Accuracy on a 4-way multiple-choice eval is a thresholded view of the same movement, and it can sit at chance for decades of compute, then jump.

k (choices)4

threshold sharpness8.0

eval items50

Crank threshold sharpness up and the accuracy curve becomes a cliff. Drop eval items and the cliff turns into a staircase. Neither knob touched the loss curve, which is humming along the same straight line in nats per token. Some emergence is real — a discrete capability that suddenly clicks. A lot of it is what happens when you push a smooth probability through a hard threshold.

Toggle between viewing the run as log-loss (smooth) and viewing it as multiple-choice accuracy (jumpy). Crank the threshold sharpness up to make the cliff steeper. Drop the eval-item count to make it look like a staircase. None of these knobs touch the underlying loss curve.

The Schaeffer et al. (2023) paper put numbers on this: a meaningful fraction of emergent capabilities, when re-evaluated with continuous metrics like log-likelihood instead of argmax accuracy, turn back into smooth curves. Not all of them. Some really do look like phase changes in some internal circuit clicking on. But the lesson for the careful reader of the scaling-laws paper is: smooth loss is consistent with both no emergence and apparent emergence, depending on what you put on the y-axis.

How the field stress-tested it

The story of scaling laws after 2020 is the story of the field probing where they break and fixing the parts that broke.

Chinchilla (2022)

Hoffmann et al. did three things. First, they re-fit the law with cleaner experiments, holding compute fixed and sweeping the N/D ratio. Second, they got a different answer than Kaplan: at compute-optimal allocation, you want roughly 20 tokens per parameter, not the 1–2 tokens per parameter the Kaplan recommendation pointed toward. Third, they trained Chinchilla, a 70B model on 1.4T tokens, which beat the much-larger 280B Gopher trained on a quarter of the tokens. The point landed.

Why was Kaplan's answer different? Mostly because of how learning-rate schedules were handled. Kaplan kept LR schedules roughly fixed across runs, which under-trains larger models. Chinchilla properly tuned LR for each scale, which freed larger models to actually learn from more tokens. A subtle methodological difference, a meaningfully different recipe at frontier scale.

If you re-run the bottleneck demo above with this in mind: pre-Chinchilla frontier models were sitting below the teal ridge — undertrained giants. Chinchilla-era models sit on the ridge. The recipe shift was, mechanically, the gold dot in the original surface plot moving.

Data ran out (sort of)

By 2023 a different problem appeared. Frontier runs started consuming a meaningful chunk of available high-quality web text. CommonCrawl is finite. High-quality books, code, papers, dialogue — also finite. The Chinchilla recipe says spend more on tokens, but at some point the tokens you can buy are the ones you wouldn't have bought.

The responses were varied and ongoing. Repeated data: just train on the same text again. The original law was fit assuming each token is seen once; repeats give diminishing returns but don't crash the curve catastrophically. Muennighoff et al. (2023) quantified this: roughly 4 epochs of high-quality data is fine, beyond that returns degrade fast. Synthetic data: generate text from larger or specialised models and train on it. Distillation: train a small model to match the output distribution of a large one, recovering some of the large model's capabilities at a fraction of the inference cost. Curation: spend compute filtering rather than training, because a clean token is worth several noisy ones.

Each of these is a way to keep moving along a scaling curve when the unmodified curve says you can't. None of them break the underlying law; they extend the regime in which it applies, or shift its prefactors.

Reasoning models added a fourth axis

Then o1 and R1 showed up and added something the original paper never imagined: spend compute at test time by letting the model produce long chains of thought, and capability goes up. The relevant graph is no longer loss vs training compute. It is also accuracy vs inference compute, with a separate slope and a separate ceiling.

The accounting question for any frontier lab is no longer where on the (N, D, C_train) surface should we sit. It is how do we allocate among pretraining compute, post-training compute, and inference compute, given that each has its own returns curve and the customer eventually pays for all of them. This problem is not solved. It's the open scaling-laws problem of 2026.

The same shape, in other modalities

The original paper was about language. The same family of curves shows up almost everywhere people have looked: vision (Zhai et al. 2022, scaling ViT), multimodal (Henighan et al. 2020), code generation, protein language models, even decision-transformers in RL. Different prefactors. Same straight lines. Different irreducible-loss floors that depend on the entropy of the modality. Same exponents on N and D in the same approximate ballpark.

What does not show clean scaling laws are tasks where the loss function is itself non-stationary — for instance, RL with reward shaping that changes during training, or self-play environments where the opponent improves with the agent. Those produce loss curves that bend, plateau, and re-accelerate, in ways that can't be summarised by three constants. The lesson is that the scaling law is a property of prediction problems on a fixed distribution, not of training in general.

Reading the curve in the wild

Once the curve is in your head, every model release reads differently. You see a model card that says 70B parameters, trained on 15T tokens. You compute 6ND ≈ 6.3e24 FLOPs. You compute D/N ≈ 215 tokens per parameter — well past Chinchilla, deep into the trade compute for inference cost regime, because at this point the people doing the run aren't trying to minimise pretraining loss. They're trying to minimise inference loss for a given served-token budget, which says: train smaller, train longer.

That ratio of tokens per parameter is the single most informative number in any model release. It tells you almost the whole story of how the lab thought about its problem. Below 5: a research run, probably a quick scaling experiment. Around 20: someone followed Chinchilla literally and is reporting compute-optimal training loss. Above 100: someone is solving an inference economics problem, not a pretraining loss problem. Above 500: someone is shipping for edge devices and wants the smallest model that benchmarks acceptably.

Each of those choices makes sense in its own world. The scaling law is the lingua franca that lets you read across them.

What to take from it

The scaling laws paper is a small one — three knobs, three power laws, a joint fit. Its impact is out of proportion to its size because it changed what the question was. Before: will the bigger model be smart enough? After: here is the loss at the scale we will run; the only argument left is whether that loss is the loss we need.

Pretraining loss is forecastable, on log-log paper, across orders of magnitude in N, D, and C. This is empirical, not theoretical, and it is robust enough to budget multi-hundred-million-dollar runs.
The functional form is L = E + A/N^α + B/D^β, with an irreducible loss E, a parameter-deficit term, and a data-deficit term. Each term means a specific thing physically; you can read off which is the binding constraint in any particular run.
Compute-optimal training is a constrained-optimisation problem: pick (N, D) on the iso-FLOP curve that minimises L. Kaplan got this slightly wrong. Chinchilla got it right. The form of the answer is unchanged.
The law breaks for data quality, post-training, inference compute, mixture-of-experts, and any open-system extension where the model can spend compute at test time or pull state from outside its weights. These are extensions, not refutations.
Emergence in metrics is partly real and partly an artifact. Smooth loss is compatible with jumpy benchmark scores. Knowing the difference matters when you're forecasting capability rather than loss.

If you want to feel the result rather than read about it, the exercise is this: pick any two model announcements from the last five years. Find the parameter count, the token count, the reported loss or held-out perplexity. Plot them on log-log axes. They will line up. Once that has happened to you on real numbers, you will never again read a model release without first locating it on the curve.

The figures in the paper do most of the work. Figure 1 alone — three log-log plots showing loss vs parameters, tokens, and compute, all linear — is the paper. Stare at it until the linearity feels boring rather than surprising. That feeling of of course it's a straight line is the moment scaling laws stop being a result and start being a tool.

What's next, if you want to keep going: the Chinchilla paper (Hoffmann et al. 2022) for the corrected recipe, the Schaeffer emergence paper (2023) for the metric-artifact argument, and the o1 reasoning report (OpenAI 2024) for the inference-time compute axis. Together they form the modern scaling story. Kaplan was the foundation. The rest is what's been built on it.

Read the original Next: Language Models are Few-Shot Learners