Machine Learning / 2022 / arXiv

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models — Hoffmann et al.

GPT-3 and Gopher were trained wrong. They had the right amount of parameters and badly wrong amount of data, and nobody noticed because the loss curve still went down.

Read original paper Back to papers

Suppose someone hands you a fixed pile of GPUs — say, enough to do 10²³ floating-point operations of training. You have to spend them on a language model. You get exactly two knobs: how many parameters N the model has, and how many tokens D it sees during training. Bigger N means more capacity per token. Bigger D means more evidence per parameter. The pile of compute you can afford is roughly the product of the two.

There is a right answer to how to split the budget, and for several years the field had it wrong. Kaplan et al. (2020) — the original Scaling Laws paper — said: when you get more compute, almost all of it should go into making the model bigger. Tokens grow slowly. Parameters grow fast. Under that recipe, GPT-3 was trained as a 175B-parameter model on 300B tokens. Gopher was 280B parameters on 300B tokens. DeepMind's Chinchilla paper went back, redid the experiment more carefully, and announced something that retroactively embarrassed half the frontier labs: those models were undertrained. For the compute they used, they should have been about a quarter the size and seen about four times the data.

This is the paper that turned scaling from vibes into accounting. The underlying calculation — how many tokens does a model of size N actually want to see — is still the rough planning equation behind every frontier pretraining run today, even when modern recipes break it on purpose.

What follows is the whole argument in slow motion. The 6ND identity that lets you convert FLOPs into a (params, tokens) pair. The iso-FLOP experiment that let DeepMind read the bowl shape directly off training runs. The three independent fits that all landed on the same exponent. Why "20 tokens per parameter" is a useful number and a misleading one. And then, because four years have passed, why every frontier lab now trains past Chinchilla on purpose, and what that means.

Why this paper mattered

Before getting into the math, it's worth saying plainly what was at stake. Between 2020 and 2022, every major lab — OpenAI, Google, DeepMind, Microsoft, Anthropic, Meta — was racing to build the largest possible language model. The ambient theory, set by Kaplan, was that parameter count was the binding constraint: more parameters per FLOP, fewer tokens per parameter. GPT-3 cost roughly $5M to train. Megatron-Turing NLG, at 530B parameters, cost more. Gopher, at 280B parameters, was DeepMind's flagship. Each of these was, in retrospect, a multi-million-dollar mistake. They achieved a worse loss than a smaller, better-trained model on the same hardware would have.

The Chinchilla paper landed in March 2022 and quietly rearranged the field. Within twelve months, every frontier model in development had been re-planned around the new ratio. Llama (February 2023) explicitly cited it; Llama 2 (July 2023) doubled down; Mistral, Qwen, DeepSeek, and every subsequent open-weights release inherited the assumption. The single number — "about 20 tokens per parameter" — became as load-bearing in pretraining design as the depth-vs-width ratio in convnets, or the dropout rate in old MLPs. It's hard to find another paper from the past five years whose numerical claim has propagated this widely with this little modification.

Compute is just N times D, times a constant

Start with the conversion that makes the rest of the paper legible. Training a transformer for one token costs roughly 6N floating-point operations. Two of those are for the forward pass — one multiply and one add per parameter that participates in computing the next-token logit. Four more come from the backward pass: gradients with respect to inputs, gradients with respect to weights, and a small amount of optimizer arithmetic. The 6 is approximate; it depends on exactly which architectures you count and whether you include attention's quadratic-in-context-length term. For typical pretraining at 2k–8k context the approximation is good to a few percent.

Multiply by D tokens of training and you get the total compute C of a training run:

C ≈ 6 · N · D

That equation is the whole game. Pick any two of (compute, parameters, tokens) and the third is determined. If you're handed a budget of C FLOPs and you want to spend them all, you are choosing a point on the curve N · D = C/6. A 70B-parameter model trained on 1.4T tokens lives on that curve at C ≈ 5.9 × 10²³ FLOPs. So does a 7B model trained on 14T tokens, and a 700B model trained on 140B tokens. They cost the same to train. They do not perform the same.

What 6ND lets you compare

Why does this matter? Because before 6ND, comparing training runs was a mess. One lab reports tokens, another reports steps, a third reports wall-clock GPU-hours on hardware nobody else has. With 6ND you can convert any run to a single number, plot it on a log axis, and ask: is the loss what we'd expect at this compute? It made scaling laws measurable. The rest of the paper is what you can do with that measurement.

The iso-FLOP idea

Now you can ask the right question. Hold compute fixed at some value C. Sweep along the curve N · D = C/6: at one end you have a tiny model trained on a huge amount of data, at the other end you have a giant model trained on very little. For each point on the curve, train the model and measure final loss. You'll get a U-shaped — well, bowl-shaped — curve in N. The bottom of the bowl is the compute-optimal point for that budget.

Why a bowl? Two ways to be wrong, one way to be right. On the left side of the bowl, N is small and D is huge: you're pouring tokens into a model that doesn't have enough parameters to absorb them. The model overfits to nothing because it can't represent the patterns in the data; the loss is high because capacity is the bottleneck. On the right side, N is huge and D is tiny: every parameter has barely seen any evidence. Most of the network has never been pushed by a gradient that mattered. Loss is high because data is the bottleneck. The bottom of the bowl is the place where capacity and evidence are balanced — both are scarce, both are binding constraints.

Then move to a bigger budget C′ and repeat. New bowl, new bottom. Do this for many budgets. Connect the bottoms. That curve — the locus of compute-optimal (N, D) pairs — is what Chinchilla set out to measure. The bowls are called iso-FLOP curves because every point on one bowl uses the same total compute.

The actual fitting in the paper is done three different ways, each more careful than the last, but the answer they all converge on is roughly: at the bottom of the bowl, N and D should grow at about the same rate. Specifically, when you 10× your compute, both N and D should grow by roughly √10 ≈ 3.16×. Kaplan's original law said N should grow by roughly 5.5× while D grew by only 1.8×. That's the disagreement that mattered.

iso-FLOP curves

Fix the budget. Sweep the model size. Find the bowl.

For each compute budget C there is one model size N that minimises loss. Smaller models starve their parameters; larger ones starve their tokens. The bottom of the bowl is the Chinchilla point.

At your budget the bowl bottoms out around 33.5B parameters and 3.1T tokens — that’s about 94 tokens per parameter, the Chinchilla rule. Notice where GPT-3 and Gopher sit: high on the parameter axis, far up the right wall of their respective bowls. Same compute, much worse loss. That’s the entire result of the paper, in one curve.

Walk along an iso-FLOP curve. The compute is fixed; you're trading parameters for tokens. The bowl shows where loss bottoms out. GPT-3 and Gopher sit on the right wall of their bowls — the same compute, far from optimal.

Reading the ridge in two dimensions

It helps to see the same surface in two dimensions. The bowl plot collapses one axis (compute) by holding it fixed. Step back and put parameters on one axis and tokens on the other, and you get a 2D map. Iso-FLOP curves are diagonal lines (in log-log they're straight lines of slope −1, since log N + log D = log(C/6) is constant). The compute-optimal frontier — Chinchilla's main claim — is a single ridge running diagonally across the plane. Famous models sit on it, near it, or far from it.

iso-FLOP map (N, D)

The compute-optimal ridge, with famous models pinned to it.

Every diagonal line is one compute budget: stay on the line and you're spending the same amount of training FLOPs. The gold ridge is the locus of compute-optimal points. Click a model to see how far it falls from the ridge along its own iso-FLOP slice.

model
GPT-3

compute C
10^23.5

actual loss
2.002

ridge loss
1.954

loss left on the table
+0.048

ridge would say
24.4B · 2.2T

GPT-3. 175B parameters, 300B tokens. Trained under Kaplan's recipe: parameters dominate. Sits high above the ridge — same compute could have built a much smaller, much better model.

The full (N, D) plane. Every diagonal is one compute budget. The gold ridge is the compute-optimal locus. Click GPT-3 or Gopher and you can read off, in loss units, exactly how much was left on the table.

Look at GPT-3 in that picture. It sits on its iso-FLOP line — they spent the FLOPs they meant to spend — but high above the ridge, far up the parameter axis. Move along that line back down to the gold ridge and you arrive at roughly N ≈ 30B, D ≈ 1.7T tokens. Same compute. Lower loss. That displacement is the entire result of the paper, made visual. Gopher is the same story with a bigger budget. Llama-1's 65B model, trained two years later, sits on the ridge. Llama 3's 8B sits below it — overtrained on purpose, for reasons we'll get to.

There's a useful intuition for why the ridge slopes the way it does. In log-log coordinates, the iso-FLOP lines have slope −1: a 10× increase in N requires a 10× decrease in D to keep compute constant. The Chinchilla ridge has slope +1 (because D ∝ N at the optimum). The two slopes are perpendicular in the diagonal sense — moving along the ridge increases compute, moving across the ridge wastes it. That's why the bowl is so cleanly defined: the ridge is the unique direction in which compute and capability scale together, and any deviation from it spends FLOPs without buying loss.

The 20-tokens-per-parameter rule

If you read the paper for one number, it's this: at the compute-optimal point, you want roughly 20 tokens per parameter. A 7B model wants ~140B tokens. A 70B model wants ~1.4T tokens. A 700B model wants ~14T tokens, which is approximately the total amount of high-quality text ever written in English (give or take), which is its own problem we'll come back to.

The rule is convenient because it folds the messy fitting work into a single ratio. But the actual claim is the joint scaling: both N and D should rise as roughly C^0.5 along the optimal frontier. The 20:1 ratio is what falls out at the budgets they tested. It is a useful planning heuristic, not a deep constant of nature. If you re-parameterise the iso-FLOP slice by r = D/N and plot loss against r directly, the bowl is right there in front of you, and the bottom is in the neighbourhood of 20 across six orders of magnitude in compute.

the bowl in r = D/N

Plot loss against tokens-per-parameter directly. The minimum sits near 20.

Walk along a fixed-compute slice but parameterise it by the ratio r = D/N instead of by N. The bowl shape pops out immediately. Slide the compute and watch the bowl drop, but the bottom barely moves in r.

At C = 10^23.0 the bowl bottoms out at r* = 78.3. Move the slider through six orders of magnitude in compute. The bottom drifts gently, but it stays in the neighbourhood of 20:1. That's why "20 tokens per parameter" works as a planning rule.

Same iso-FLOP slice, different parameterisation. The x-axis is now r = D/N. The bowl bottom barely moves as you slide compute. That stability is what makes 20:1 a usable rule.

Why does the optimal r drift only a little with compute? Because the Hoffmann fit gives nearly equal exponents on N and D: α = 0.34, β = 0.28. Loss is E + A/N^α + B/D^β. At the optimum the marginal returns from adding a parameter and adding a token have to be equal, and that condition is dominated by the exponents. If α and β were wildly different, the optimal ratio would change a lot with compute. They aren't, so it doesn't, much. By 10²⁵ FLOPs the optimum drifts to about 23 tokens per parameter; by 10²⁶ to about 26. Close enough to 20 that nobody changes their plans.

Three roads, one exponent

Hoffmann's group did not fit the law one way and call it. They did it three ways and asked whether the answers agreed. Each approach is a different empirical procedure, sensitive to different errors, and each one is independently informative.

Approach 1: fix N, vary D

Pick a model size N. Train it for a long time, periodically saving the loss as a function of compute spent. You get a training curve — loss decreasing in D. Repeat for many N. For each compute budget C, scan across the training curves and find the N whose curve is lowest at exactly that compute. That N is the compute-optimal model size for C. Read off (N, D*) for each C, fit a power law, get exponents a and b such that N* ∝ C^a and D ∝ C^b. The paper gets a ≈ 0.50, b ≈ 0.50.

Approach 2: fix C, vary N

This is the iso-FLOP approach the article has been showing. For each compute budget C, train a family of models with different N (and thus different D = C/6N). Find the N that minimises loss. Repeat across C. Fit. The paper gets a ≈ 0.49, b ≈ 0.51. Approach 2 is the cleanest of the three because each data point is the actual final loss of a fully-trained model at exactly that compute.

Approach 3: fit a parametric loss surface

Assume L(N, D) = E + A/N^α + B/D^β. Fit E, A, α, B, β by Huber regression on all 400+ training runs at once. With the surface in hand, analytically solve for the (N, D) that minimises L subject to 6ND = C. The paper gets a ≈ 0.46, b ≈ 0.54. Slightly different exponents but the same qualitative story: roughly equal scaling.

three roads to the same exponent

N grows like C^0.5 — by every method that isn't broken.

The paper's killer move is robustness: three independent fitting procedures, the same answer. Toggle one and inspect its slope, or show all three at once on top of Kaplan's older fit. The Chinchilla lines lie on top of each other; Kaplan's drifts away with compute.

show all three

approach 1 slope
0.452

approach 2 slope
0.452

approach 3 slope
0.452

Kaplan 2020 slope
0.730

Three estimators, three routes, the same slope near 0.5. That's why Hoffmann's claim landed: it isn't an artefact of one fitting choice. Kaplan's curve, drawn for comparison, leaves the others by an order of magnitude in N once you cross 10²³ FLOPs — exactly where GPT-3 and Gopher were trained.

All three approaches plotted in (compute, optimal-N) space, on top of Kaplan's older fit for contrast. Click any approach to highlight it. Notice how Kaplan's slope diverges from the others — that's the half-billion-dollar disagreement.

The takeaway is robustness, not the exact slope. The three slopes are 0.50, 0.49, 0.46. They disagree at the second decimal. They agree at the first, and they all sit far from Kaplan's 0.73. That's how you know the result isn't an artefact of one fitting choice.

Why a 70B model on 1.4T tokens beat a 280B model on 300B tokens

To prove their fitted curve was right, DeepMind did the obvious thing: they used the same compute budget as Gopher (~5.76 × 10²³ FLOPs), put it on the compute-optimal point of their fit instead of Gopher's, and trained that. The compute-optimal point predicted N ≈ 70B and D ≈ 1.4T tokens — the same total cost, but a quarter of the parameters and four times the tokens. They named it Chinchilla.

Chinchilla beat Gopher on basically every benchmark they ran. Not by a small margin. On MMLU, Chinchilla scored 67.5% to Gopher's 60%. On the BIG-bench language tasks, on reading comprehension, on common-sense reasoning, on translation — Chinchilla was ahead almost everywhere. Same compute, same data source, smaller model, more training. The capacity that Gopher had stuffed into its 280B parameters was sitting unused — it never saw enough tokens to learn what those parameters could have represented. The 70B model didn't have that excess capacity to waste; every parameter it had got pushed by gradient signal until it converged to something useful.

There's a secondary, very practical consequence. A 70B model is much cheaper to run at inference time than a 280B model — roughly four times cheaper per query, and similarly cheaper to fit on a given GPU configuration. So Chinchilla wasn't just better at training-FLOP efficiency. It was about 4× cheaper per query for the rest of its life. That made the result lethal for production deployment, and it's roughly why every open-weights frontier model since (Llama 2, Llama 3, Mistral, Qwen) has been smaller and more data-hungry than the pre-Chinchilla generation.

We test this hypothesis by training Chinchilla, a predicted compute-optimal model with 70B parameters trained on 1.4T tokens. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
Hoffmann et al., 2022

What the math says about your next 10× of compute

Here's the practical takeaway, which is shorter than the derivation. If today you're training a compute-optimal model with parameters N and tokens D, and tomorrow your budget grows by a factor of k:

Multiply parameters by √k
Multiply tokens by √k
Total compute grows by k (because √k · √k = k)

So a 10× compute increase means about 3.16× more parameters and 3.16× more tokens. Not 10× parameters, not 10× tokens, not even 5× and 2×. Roughly equal scaling on both axes. This is what Kaplan got wrong, and what Chinchilla got right, and what every pretraining plan since has used as a starting point — even when the plan ends up departing from it.

Why the original scaling laws missed this

The honest answer is that Kaplan et al. used a fixed learning rate schedule that wasn't tuned per run, and the schedule penalized longer training. Models trained for more tokens looked worse than they should have because the cosine decay wasn't bottoming out at the right place. Specifically, the cosine schedule's final value was set as if every run was the same length, which meant short runs looked great (the schedule got to its low-LR phase right when loss was still falling fast) and long runs looked bad (the schedule decayed to near-zero LR while there was still useful gradient signal left in the data). Chinchilla retuned the schedule for each (N, D) pair so each run actually finished its decay at the end of training — not before, not after — and the fitted curve moved by a factor of four in tokens-per-parameter.

This is a useful warning about empirical scaling work. The thing being measured — "loss as a function of compute" — is downstream of a hundred details (learning rate, batch size, optimizer, data ordering, warmup) that all interact. Get one of them wrong systematically across your sweep and your fitted exponent walks away from the truth. You can be off by a factor of four in tokens-per-parameter and not know.

It's also a warning about the value of replication. Chinchilla wasn't conceptually new — it was a careful redo of the same kind of experiment Kaplan ran. The bug was systemic enough that no amount of staring at Kaplan's plots would have revealed it. Only running fresh experiments with a fixed bug would. The field paid hundreds of millions of dollars in undertrained models before someone redid the work properly.

Where the rule cracks: data quality, MoE, inference, repetition

The Chinchilla fit assumed a fixed data distribution, dense models, no repeated epochs, and that you only care about training-time loss. Each of those assumptions has since cracked. If you sat down today to train a frontier model and used Chinchilla's recipe verbatim, you would be leaving a great deal on the table. The next demo lets you toggle between four common modifications and see how the optimum drifts.

post-Chinchilla complications

Four ways the simple ratio breaks under modern training.

The Chinchilla bowl assumes dense models, single-epoch training, interchangeable tokens, and zero inference cost. Each of those falls. Toggle between the four most common patches and watch the optimum drift away from the gold ridge.

Compute C (FLOPs)10^23.80

data quality multiplier q2.00×

baseline N · D
33.50B · 3.14T

modified N · D
45.97B · 2.29T

shift in N
+37%

modified loss
1.903

A token of FineWeb-Edu carries more signal than a token of raw Common Crawl. Multiply effective tokens by q. The bowl shifts; you want a bigger model on the same training compute, because each token now teaches more.

Four ways the modern frontier breaks the simple ratio. Toggle between them and watch the optimum drift away from the gold ridge.

Data quality

Chinchilla counts tokens as if they were interchangeable. They aren't. A token of carefully filtered code or math is worth more, in capability per parameter, than a token of forum spam. Modern recipes spend enormous effort on filtering, deduplication, and synthetic data generation, and the effective number of "useful" tokens is much smaller than the raw count. The 20:1 rule applied to Common Crawl is a different equation than the 20:1 rule applied to FineWeb-Edu. Concretely, the Phi family of models from Microsoft showed that aggressively filtered and synthetic data could let a 1.3B model match the performance of a 7B model trained on raw web text — that's a 5× displacement off the Chinchilla ridge purely from data quality.

The cleanest way to think about this is to multiply effective tokens by a quality factor q: D_eff = q · D. Chinchilla's analysis still applies, just with D_eff instead of D. Higher q means the bowl bottom shifts toward larger N — every token now teaches more, so you can afford a bigger model on the same training compute. Whether you call this "breaking Chinchilla" or "applying Chinchilla after a unit change" depends on how strict you want to be.

Repeated epochs

There's a deeper question lurking under repetition: at what point does the curve of loss-vs-tokens stop being well-defined? Chinchilla treated tokens as IID samples from a single distribution. Real pretraining mixes domains (web, code, math, multilingual) and the marginal value of one more token depends on which mixture component it came from. Once you start re-weighting the mixture and repeating high-value subsets, you've left the regime where a single power law in D is the right model. Modern practice uses curriculum on top of Chinchilla: start with broad web data, anneal toward higher-quality sources at the end of training, mix in synthetic data generated by an earlier checkpoint. None of this is captured by a 1990s-style scaling law.

Once high-quality data runs out, the question becomes: should you train a smaller model on unique tokens, or a bigger model on repeated ones? Repetition has diminishing returns but isn't worthless. Recent work (Muennighoff et al., "Scaling Data-Constrained Language Models") suggests up to ~4 epochs is roughly free; past that, repeated tokens contribute much less than fresh ones. Chinchilla's accounting doesn't have a slot for this. In a data-constrained regime, the iso-FLOP bowl flattens out — you can't get to the bottom because the bottom requires more unique data than exists. The right move is to grow N further than Chinchilla would suggest, accept slightly worse loss per FLOP, and stop counting tokens that have been seen too many times.

Mixture of experts

MoE deserves a slightly longer treatment because the accounting is genuinely confusing. A dense 70B model has 70B parameters, all of which see every token. A sparse 70B-active model with 8× expansion has 560B total parameters but only 70B of them activate per token. Training cost is dominated by activated parameters, so the model trains roughly as fast as a 70B dense model — but it has 8× the storage to put knowledge into. Capacity is somewhere between the two extremes, depending on how well the routing learns to specialize.

An MoE model has N total parameters but only activates a fraction per token. The training FLOP cost is closer to 6 · N_active · D, but capacity scales more like N_total. So you get Chinchilla's data hunger of an N_active-sized model while having the storage of a much bigger one. The whole 6ND identity needs revising: replace N with N_active on the training-cost side, but keep something closer to N_total on the capacity side. The cleanest way to model it is to treat capacity as scaling like N_active · √(N_total/N_active) — between linear and constant in expansion factor. Mixtral, DeepSeek-MoE, and Qwen-MoE all sit firmly off the dense Chinchilla ridge, and they're correctly designed for it.

Inference compute

Of the four cracks, this is the one that has bent training plans hardest. A pretraining run is a one-time cost. Inference is paid forever. If a model serves a trillion tokens over its production life, even a small reduction in per-token serving cost dwarfs the training spend. So the question becomes: what's the cheapest model, in lifetime FLOPs, that achieves a given loss? The answer is almost never Chinchilla-optimal.

Chinchilla optimizes training loss per training FLOP. If you're going to serve a model billions of times, you'd rather pay more in training to get a smaller model with the same loss. Llama 3's overtraining is exactly this trade. Sasha Rush and others have written out the calculation cleanly: the optimum smoothly rolls down the iso-FLOP curve toward smaller N as the expected inference-to-training ratio grows. At zero inference (research-only run), Chinchilla is exactly right. At Llama-3-scale deployment (trillions of inference tokens), you want a model that's three to five times smaller than Chinchilla would suggest, trained much longer.

And if the model is going to do test-time reasoning — long chains of thought, tree search, sampling many candidates — the calculation gets messier still. The relevant cost is now compute per useful answer, integrated over the model's lifetime, not loss per pretraining token. A reasoning model running 10,000 tokens of chain-of-thought per query has shifted the goalposts entirely; Chinchilla optimization doesn't speak to that regime, and post-training compute (RL, distillation, preference learning) lives in a different equation.

A worked example: planning a 10²⁴ FLOP run

Suppose you have $10M of GPU time, which at H100 prices in 2024 buys you roughly 10²⁴ training FLOPs. What does Chinchilla say to do? Plug into the closed-form solution to Approach 3 (Equation 10 of the paper, which you can solve in a few lines): the optimal N is around 60B parameters, the optimal D is around 2.8T tokens, and the predicted loss is around 1.94. That's the recipe before any modern adjustments.

Now apply the modifications. Suppose your data is FineWeb-Edu rather than raw Common Crawl, worth maybe 2× per token. Effective tokens jump to 5.6T, and the optimum N shifts up — call it 80B. But you also know you'll serve this model 100× more inference tokens than you trained on, so you want to push N back down to keep inference cost manageable; the inference-aware optimum is more like 30B parameters trained on 5.5T effective tokens. If instead you go MoE with 8× expansion, you can keep N_active near 30B but ship a model with 240B total parameters and substantially better capacity, while still spending the same training compute. Each of these decisions is a separate axis of optimization, and they don't compose cleanly. Chinchilla gives you the baseline against which all of them are measured.

This is why the 20:1 number is both ubiquitous and frequently violated. It's the pivot point. Every modern training plan starts with "Chinchilla would say N = X, D = Y" and then walks one or more steps away from that, with a written justification for each step. The plan that doesn't reference Chinchilla at all is the suspicious one — it usually means the team didn't do the calculation.

What's still true

Despite all the cracks, the central claim of the paper has held up for four years and counting: at training time, parameters and data scale together, not separately. Whenever a lab announces a model that's much bigger than its data budget can support, it underperforms a smaller model with the same compute. This has happened often enough since 2022 that the planning question "are we Chinchilla-optimal?" is now the first slide in any pretraining design doc, even when the answer turns out to be "no, we're deliberately past it because of inference".

The deeper lesson is methodological. Scaling laws looked like a curiosity in 2020 — neat plots, mostly retrodicting things people already knew. Chinchilla showed they could be wrong in ways that cost hundreds of millions of dollars, and right in ways that saved them. "Empirical loss as a function of clean inputs" is a real measurement instrument now, and a lot of the field's planning runs on it. When OpenAI, Anthropic, Google, or DeepSeek announce a new model and report a final loss number, they are implicitly comparing it to a Chinchilla-style fit and asking whether the gap is the size they expected.

There's also a quieter epistemological lesson. The Chinchilla coefficients (E ≈ 1.69, A ≈ 406, α ≈ 0.34, B ≈ 410, β ≈ 0.28) are not laws of physics. They're properties of a particular dataset (MassiveText), a particular architecture family (decoder-only transformers), a particular tokenizer, a particular optimizer (AdamW with their specific hyperparameters). Recent re-derivations on different mixtures get slightly different numbers. The thing that's robust is the shape of the law — power-law floors, two terms in N and D, near-equal exponents — not the exact coefficients. Treat the coefficients as a starting point for your own fitting, not as a constant to memorize.

If you read the original

The paper is dense but the core argument lives in three figures and one table. Figure 2 shows the iso-FLOP bowls. Figure 3 connects the bottoms and gives you the joint scaling exponents. Table 3 is the ablation that proved Chinchilla beats Gopher. You can skim everything else.

If you want to feel the argument in your hands, write a tiny script that takes a compute budget and returns the optimal (N, D) and predicted loss using their fit (Approach 3, Equation 10 of the paper). Plug in 10²², 10²³, 10²⁴, 10²⁵ FLOPs. Notice how fast the data requirement runs into the wall of "all the text on Earth." That wall is the reason the field has spent the last two years pivoting hard toward synthetic data, post-training, and inference-time compute.

The 2024 audit and what survived it

There is a follow-up worth knowing about. In 2024, Besiroglu et al. tried to reproduce Hoffmann's Approach 3 fit and reported they couldn't get the same coefficients out of the published data — the implied tokens-per-parameter ratio at 10²³ FLOPs is closer to 25:1 than 20:1, depending on which subset of training runs you fit on. The qualitative claim (roughly equal scaling, much smaller than Kaplan) survives the audit. The exact number does not. This is consistent with the broader story: the shape of the law is robust, the coefficients are not, and any team training a frontier model today re-fits the law on their own data and architecture before believing any of it.

The five-line takeaway: training compute is approximately 6ND. For a fixed compute budget, there's a unique balance of N and D that minimises loss. That balance is roughly 20 tokens per parameter, with both N and D growing as C^0.5. GPT-3 and Gopher were on the wrong side of the bowl; Chinchilla was on the bottom. Modern models train past the bottom on purpose, because inference is where the money is.

Read the original Next: Training Language Models to Follow Instructions with Human Feedback