Machine Learning / 2020 / arXiv

Language Models are Few-Shot Learners

Language Models are Few-Shot LearnersBrown et al.

The shock of GPT-3 wasn't that a 175B model could write fluent English. It was that you could teach it a new task by typing three examples into the prompt.

Pre-2020, teaching a language model a new task meant gradients. You collected labeled data, you finetuned, you held out a validation set, you watched a curve, you shipped weights. Every task was a little training pipeline. The interface to a model was a directory of checkpoints, one per job. If you wanted sentiment, you finetuned a sentiment model. If you wanted summarisation, you finetuned a summarisation model. If you wanted both, you had two models and two operational headaches.

GPT-3 changed the interface. The same frozen weights could translate French, fix grammar, do three-digit arithmetic, unscramble anagrams, write SQL, mimic a journalist, or impersonate a Python REPL — and the way you specified which one was by typing. You wrote a few input-output pairs into the prompt, then a fresh input, and the model continued the pattern. No gradient updates. No training loop. The task description and the training data were the same object: text in the context window.

The Brown et al. paper is mostly an enormous evaluation grid showing this works across dozens of benchmarks. It is 75 pages and most of it is benchmark tables. The interesting thing isn't the grid. It's what the grid implies about what the model is actually doing — and about the new shape of the interface between humans and machine learning systems that the grid was the first public evidence of.

If you read the paper today the wild claims feel ordinary. Of course you can prompt a language model. Of course a few demonstrations help. But in 2020 every one of those claims was a heresy. Finetuning was the religion. Task-specific architectures were table stakes. The idea that a single set of frozen weights would beat purpose-built systems on tasks the authors hadn't even anticipated was, in practical terms, the moment NLP turned into something else.

What the paper actually argued

The paper's contribution isn't really GPT-3 the model. It's the framing. Brown and coauthors formalised three regimes that all use the same frozen weights and differ only in how much demonstration the prompt contains. Zero-shot: a natural-language task description, then the query. One-shot: the description plus exactly one example. Few-shot: the description plus a handful, typically 5 to 50.

Zero-shot looks like this:

Few-shot looks like this:

Same model. Same weights. The only thing that changed was the text the model conditioned on. And the few-shot version is dramatically better — sometimes the difference between gibberish and useable output. The paper carefully separated this from finetuning by never letting gradients touch the model. The 'learning' in 'in-context learning' is a metaphor: nothing learns in the optimisation sense. The model behaves as if it had learned, because the context provided enough scaffolding for the right computation to fall out.

zero / one / few-shot

Same task, three different prompts

Capital-of-country, asked three ways. The model and the query never change. What changes is how much *evidence* the prompt provides about what counts as a valid answer.

zero-shot
k = 0 · instruction only
30%
PROMPTWhat is the capital of a country?Germany => P(next token)BerlinMunichFrankfurtthe capital i…germany?
one-shot
k = 1 · 1 demonstration
61%
PROMPTWhat is the capital of a country?France => ParisGermany => P(next token)BerlinMunichFrankfurtthe capital i…germany?
few-shot
k = 3 · 3 demonstrations
95%
PROMPTWhat is the capital of a country?France => ParisJapan => TokyoBrazil => BrasíliaGermany => P(next token)BerlinMunichFrankfurtthe capital i…germany?
shots kP(Berlin) = 94.9%

At k=0 the prompt asks an abstract question — the model has to guess that the desired output is a single proper noun. At k=1 a single demo nails the format. By k=3 the format is locked in and the country→capital pattern is reinforced enough that Berlin wins.

The same task — capital of a country — across the three regimes. Watch how a single demonstration locks in the format, and a third demo reinforces the country→capital pattern enough that the right answer wins. The model and the query never change.

The framing that makes this click: the prompt is a tiny program written in the model's activation space. Pretraining produced a giant interpreter. The prompt is the source code. The interpreter doesn't get recompiled — it just runs whatever you hand it. In-context learning is what we call it when the program happens to define a task by example rather than by instruction. Zero-shot is the program with no test cases, just a spec. Few-shot is the program with three test cases included so the spec can't be misread.

There's a temptation to read 'prompt as program' as a metaphor and move on. Don't. The framing is operationally useful. It tells you how to debug a misbehaving prompt (the program is wrong), how to think about latency (your program runs every token), how to think about cost (your program is billed by length), how to think about safety (your program shares an address space with whatever else got pasted into the context), and how to think about composition (programs can be assembled from sub-prompts the way functions can be assembled from sub-functions).

Why it works mechanistically

Why should a next-token predictor that has only ever been trained to imitate the internet be able to do this? The honest answer is that nobody fully knows, but the intuitions are real and they fit together.

Pretraining is full of demonstrate-then-do

Pretraining text is full of demonstrate-then-do patterns. Stack Overflow questions show example inputs and example outputs. Textbooks work problems before assigning them. Translation pages list pairs in two languages. FAQs follow Q-A-Q-A-Q-A. Blog posts about coding interviews include the prompt, three worked examples, then a fresh problem. A next-token predictor trained on all of this learns, as a side effect, to recognise 'a few examples followed by a new input' as a context that predicts 'an output completing the pattern.' The model isn't doing anything exotic. It is finishing the sentence the way internet text would have finished it.

This single observation does a surprising amount of work. It explains why the format of demonstrations matters: the model has seen Q:/A: a million times and JSON a million times, but it has not seen your bespoke separator. It explains why few-shot beats zero-shot: more examples means a stronger conditioning signal that this particular kind of text follows this particular pattern. It explains why instruction-tuned models work at all: you have shifted the distribution of pretraining-style text the model was prepared for in the direction of imperative natural language.

Attention is the substrate

The mechanism inside the model is attention. When the prompt contains demonstration pairs, attention heads in later layers can pattern-match the structure of the new query against the structure of the demos and copy-with-modifications from the demonstration outputs. Olsson et al.'s induction-heads work (2022) made this concrete: there are specific heads in transformers that implement the find-the-previous-occurrence-and-look-at-what-came-after-it operation, which is roughly what few-shot prompting needs. The model isn't learning at inference time. It's running a routing computation that behaves like learning when the context provides the right scaffold.

The induction-head circuit is a two-layer story. The first layer's heads do prefix matching: a token at position t looks back through the sequence and identifies positions s where the same token appeared before. The second layer does shift-and-copy: at each matched s, the head shifts attention to s+1 and copies what came next. Stack those two operations and you have the algorithm 'when you see token X, output whatever followed X last time you saw it.' That is, structurally, the rule a few-shot prompt is asking the model to follow.

induction head

How attention copies an answer it’s already seen

Pick the query token (the last position in the sequence). The induction head looks back for prior occurrences of that token, attends one step to the right, and copies whatever came next. That circuit, in two layers, is most of what few-shot prompting needs.

query token
SEQUENCE1%thet=01%catt=147%At=21%Bt=31%satt=41%ont=547%At=61%Bt=71%thet=81%matt=9QUERY → PREDICTIONABCOPY DISTRIBUTIONB93%A2%<eos>1%cat1%sat1%

With query A, attention concentrates on the two prior A tokens, the dashed shift arrows step one to the right, and Bwins the copy distribution. With a query that doesn’t repeat (cat, mat) attention has no anchor and the prediction degenerates. Stack two of these heads in a transformer and you get the algorithm behind in-context learning.

An induction head finds prior occurrences of the query token and copies what came after each one. Pick the query token to see the attention pattern. The same algorithm, applied to (input, output) pairs in a few-shot prompt, is most of in-context learning.

Below: each row is a query position; each column is a context token. Notice how the query at the bottom routes most of its attention to the output tokens of nearby demonstrations — not the inputs. That's the few-shot circuit doing its job: it has located prior examples of the task and is pulling their answers into the new prediction.

in-context learning

Demonstrations bend the next-token distribution

Same prompt, same model, no gradient updates — just more example pairs in the context. Watch the probability mass slide onto the correct continuation as you add demonstrations.

shots k
P(correct) 29.2% · top guess banana
PROMPTuppercase the first letter:apple => Applecarrot => Carrotdog => Dogriver => Riverbanana => P(next token | prompt)BananabananaBANANAfruityellowwith 0 shots the prior dominates — “banana” itself looks likeliest

At k=0 the model has only the task header to go on, so its prior leans toward the literal query string. Each demo adds evidence of shape (the answer is capitalised) and pattern (the answer is the input with letter 1 uppercased). By k=4 those two signals dominate the logits and Bananawins. The model didn’t learn — its weights never moved. The context did the work.

In-context learning as a real, computable shift in the next-token distribution. Add demonstrations and the probability mass slides onto the answer that completes the demonstrated transform. The weights never move; only the conditioning text changes.

Bayesian framing as a sanity check

Another lens that pays off: treat in-context learning as approximate Bayesian inference. The pretrained model has a prior over tasks — every Q-A template, every translation pair format, every code-completion pattern it has seen during training. The prompt is evidence. Each demonstration narrows the posterior over which task is being asked, and the next-token distribution is the posterior predictive. Xie et al. (2021) made this rigorous on a toy setting and showed that in-context learning is doing something formally Bayesian, even though no Bayes rule was ever written down.

The Bayesian frame and the induction-head frame don't conflict. They describe the same thing at different levels. The induction head is a circuit that implements something that looks like Bayesian updating over a small task vocabulary. The Bayesian frame is a clean mathematical description of what the circuit ends up computing across many such heads composed together.

Scale was the unlock

GPT-2 (1.5B params) could do a little of this. GPT-3 (175B) could do a lot of it. The paper's most-shown chart is the one where benchmark performance versus model size shows the few-shot gap opening up as scale increases. Tiny models get nothing from extra demonstrations. Big models get a lot.

This is the place where the scaling-laws story and the in-context-learning story meet. Smooth pretraining loss curves don't predict that capabilities like 'do arithmetic from three examples' will appear at any particular scale. They appear when they appear — sometimes gradually, sometimes with what looks like a step function on a non-smooth metric. The paper's contribution wasn't a theory of when capabilities emerge. It was the empirical demonstration that, by 175B, a lot of them have.

scaling × in-context learning

Some abilities scale smoothly. Others wait, then jump.

Few-shot accuracy across the GPT-3 model family (real numbers from the paper). Pick a task to see its trajectory; arithmetic and unscramble are the emergence-style tasks the paper made famous.

0%25%50%75%100%125M350M760M1.3B2.6B6.7B13B175Bmodel parameters (log)few-shot accuracy80.4
sharp emergence
3-digit addition
Adds two 3-digit numbers. Flat for half the scale; emerges between 13B and 175B.
125M0.0%
350M0.0%
760M0.0%
1.3B0.3%
2.6B1.2%
6.7B8.4%
13B21.3%
175B80.4%

Look at 3-digit addition— flat at zero across six orders of magnitude, then 80% at 175B. The capability didn’t exist and then it did. Compare with TriviaQA, where every doubling of params bought real accuracy. Same family of models, same prompt format. Different shape of curve.

Real GPT-3 numbers across the model family. TriviaQA scales smoothly — every doubling of params helps. 3-digit addition is flat for six orders of magnitude, then 80% at 175B. Same training, same prompts, different shape of curve.

There's a healthy debate now about whether 'emergence' is real or an artefact of choosing accuracy as the metric. Schaeffer et al. (2023) argued that many emergent capabilities are smoother in cross-entropy than in exact-match accuracy, and the apparent step function is partly a measurement effect. They have a point. But the GPT-3 paper's behavioural claim survives the critique: by 175B you can do things with prompts that you couldn't do at 13B, and the things you can do change qualitatively, not just quantitatively. The user-facing experience is discontinuous even if the underlying loss is smooth.

The deeper point: scale didn't just make the model better at tasks. It made the model better at being prompted. In-context learning itself is a meta-capability, and that meta-capability gets stronger with scale. Small models can sometimes solve a task with finetuning that they cannot solve with prompting at any number of shots. Big models cross a threshold where the prompt becomes a viable interface for the same tasks.

It is brittle in instructive ways

If you've used these models for any real task you already know: the prompt is also the bug. Few-shot prompting is shockingly sensitive to surface details that shouldn't matter. The list of failure modes that the GPT-3 paper noted (or didn't, but should have) became a small academic industry over the next two years.

  • Order of examples changes the answer. Same demos, different order, different output — sometimes by a lot. Recency matters; the last example exerts disproportionate pull. Lu et al. 2022 ("Fantastically Ordered Prompts") showed that the worst ordering of a fixed set of demos can cost 20+ accuracy points versus the best ordering, with no other change.
  • Format of examples changes the answer. Q: ... A: ... vs Input: ... Output: ... vs colons-vs-arrows. The model has learned multiple template families and picks one based on the prompt's micro-syntax. Drop a trailing space and you can knock 10 points off.
  • Label distribution in the demos can override the actual label content. If you give it five demos that all happen to be labeled 'positive,' the sixth will tend to be 'positive' too, even if the input clearly isn't. Zhao et al. 2021 ("Calibrate Before Use") characterised this as a bias the model needs to be calibrated against, not a property of the data.
  • The exact wording of the instruction matters. Rewording an instruction that was performing at 80% can drop it to 30% with no other change. This is part of why prompt-engineering became a thing — and part of why prompt-engineering feels embarrassingly empirical compared to the rest of ML.
  • Recency bias runs in two directions. The most recent demo influences the output disproportionately, but the first demo also influences the output disproportionately. Demos in the middle get less weight. The reasons aren't fully understood; positional encodings probably play a role.
  • Token boundaries matter. Prompts that put the model in the middle of a multi-token unit (a stray BPE split, a half-word) can produce dramatically worse outputs than prompts that respect tokeniser boundaries. The cleanest path through the model is the one that aligns with how the tokeniser chunked the training data.

prompt format brittleness

The separators are part of the input distribution

Same sentiment task. Three prompt formats the model has seen during pretraining. Edit the separator strings — even a single deviation pushes the prompt off the recognised template and accuracy collapses toward chance.

Q:/A:
family prior 42%
Q: fantastic movie A: positive Q: boring sequel A:
this family recog.
0.42
total recognised
0.42
mock accuracy
65%
Input:/Output:
family prior 31%
Input: fantastic movie Output: positive Input: boring sequel Output:
this family recog.
0.31
total recognised
0.31
mock accuracy
61%
JSON
family prior 18%
{"x": "fantastic movie ", "y": "positive {"x": "boring sequel ", "y": "
this family recog.
0.18
total recognised
0.18
mock accuracy
56%
try: change "Q:" to "Question:", or drop the trailing space, or add a stray quote — watch accuracy fall.

The model isn’t reading your prompt the way you wrote it. It’s matching it against template families it saw a million times during pretraining. The closer your separators are to a known family, the more of its task knowledge gets activated. Move the separators off-distribution and you’re asking a different model.

The separators are part of the input distribution. Same task, three template families. Edit the separator characters and watch mock accuracy collapse as you walk off the templates the model recognises.

The right way to read this brittleness is not as a bug in GPT-3 specifically. It is what you'd expect from any system whose interface is the same surface as its input distribution. The prompt is both the API and the data — the model has no way to know which parts of your text are instructions and which are content. Most of post-2020 prompt engineering, instruction tuning, and chat templates is the field's response to this confusion.

A useful sharpening: think of the prompt as an arbitrary string passed to a function that has a noisy and high-dimensional notion of 'similarity to training data.' The model performs best on strings that are deeply on-distribution — strings that look like things it has seen and answered before. It performs worst on strings that are off-distribution in any direction, including directions you wouldn't think mattered (a stray emoji, an unusual whitespace pattern, a markdown header where the training data had plain text).

This is also why instruction-tuning and RLHF feel like such enormous improvements on top of base models: they push the model's mass distribution toward 'things humans actually type when they want something done', which is a much smaller and more concentrated region of input space than the full internet.

The biases the paper itself flagged

Section 6 of the paper is the limitations section, and it is unusually honest for a release of this scale. The authors documented failure modes that became central to the next half-decade of post-deployment work.

  • Repetition and contradiction: the model can loop into long stretches of repeated phrasing, or contradict itself across paragraphs, in ways that look fine token-by-token but globally fail.
  • Loss of coherence over long contexts: even within the 2048-token window of GPT-3, distant parts of the prompt could be 'forgotten' in practice, with the model's behaviour dominated by the most recent few hundred tokens.
  • Lack of grounding: the model has no access to the world. It will confidently produce text that is fluent and false. The paper called this out years before 'hallucination' became the standard term.
  • Sample inefficiency at training time: the model needed orders of magnitude more text than a human child to reach this level of fluency. Whatever GPT-3 is doing, it is not doing it the way humans do.
  • Bias along social axes: the paper documents that GPT-3 produced gender, race, and religion-coded outputs that reflect the biases of its training data. This wasn't a surprise but it was the first time the issue had been documented at this scale.

The honest reading of Section 6 is that the authors knew the system had real problems and shipped the paper anyway, because the upside — the new interface — was worth the trade. The same trade is being made by every lab now, with bigger models and more sophisticated mitigations.

The training data question

GPT-3 was trained on roughly 300 billion tokens — a filtered slice of Common Crawl, plus WebText2, two book corpora, and English Wikipedia. The paper devotes a section to contamination: the worry that the model's strong few-shot performance on a benchmark might be partly because the benchmark's test set leaked into the training data. They built a tooling pipeline to detect 13-gram overlaps between test sets and training data, then re-ran the affected benchmarks excluding the contaminated examples.

The honest finding: contamination existed, was non-trivial on a few benchmarks, and didn't change the headline conclusions much when removed. The interesting finding was the methodology. By 2023 every serious LLM paper had a contamination section because Brown et al. made it a norm. The norm was needed because the alternative — running a 175B model trained on 'all of the internet' against benchmarks scraped from 'all of the internet' — was structurally guaranteed to overestimate capability without active checking.

The deeper question that contamination raises is: what counts as in-context learning if the model has seen the test in some form during pretraining? The cleanest answer is that it doesn't matter for the user. The user wants the right answer; the model gives the right answer; the route from prompt to answer goes through whatever computational machinery the weights and the context together support. Whether that machinery is doing 'genuine generalisation' or 'sophisticated retrieval from memorised training data' is a research question, not a deployment question. But it is a research question worth taking seriously, because the two have very different scaling stories. Memorisation scales with training-set size. Generalisation scales with model capacity. Untangling them is hard, and Brown et al. got the conversation started.

Where the field went next

Once the prompt-as-program framing was visible, every direction extended naturally. If a prompt is a program, you can write smarter programs.

Chain-of-thought

Chain-of-thought (Wei et al. 2022) noticed that demonstrations including reasoning steps led to outputs that included reasoning steps, and the reasoning steps made the final answer better. The 'prompt program' got a working-memory section. The trick was simple — show the model how to think out loud — but the effect was large enough that it qualified as a new capability rather than a tweak. On grade-school math, CoT roughly tripled accuracy at GPT-3 scale. On harder reasoning, the effect was bigger.

CoT deepened the prompt-as-program framing. Now the program had local variables (intermediate steps), control flow (conditional reasoning paths), and a return statement (the final answer). The model still didn't know it was running a program; the user did, and that was enough.

Retrieval and the long context

Retrieval-augmented generation noticed that you could fetch relevant text and paste it into the prompt at inference time, giving the in-context program access to a database. This was the cleanest answer to the grounding problem in Section 6: if the model can't know the world, give it the world via the context. RAG is a 2020-era idea (Lewis et al.) but it is essentially a corollary of the GPT-3 framing — if prompts are programs, then prompts can include data, and data can be looked up at runtime.

Long-context models pushed this further. When the context window is 200K tokens you can paste an entire book, an entire codebase, an entire user history. The boundary between 'prompt' and 'memory' starts to blur. The boundary between 'in-context learning' and 'in-context working' blurs with it.

Tool use and agents

Tool use and function calling noticed that the prompt program could emit structured calls to external systems and consume their results, turning the model into a controller in a larger loop. This is the most direct extension of prompt-as-program: now the program can do I/O. Agents are this idea taken to its limit: the prompt is now a program that decides what other prompts to issue, in what order, against what tools, to accomplish a goal.

Agent loops inherit every brittleness of single-prompt few-shot, then compound it. A small format mismatch in turn 1 propagates into the model's interpretation of turn 2. A demonstration of a tool call in the system prompt biases every subsequent call. Recency bias in a 30-turn conversation can mean the model has effectively forgotten its system prompt by the end. The post-deployment world of agentic systems is partly a story of building scaffolding to compensate for the prompt-program brittleness Brown et al. measured in 2020.

Instruction tuning, RLHF, and the system prompt

Instruction tuning and RLHF reshape the base model so that natural language instructions (rather than examples) work better. That's a different style of in-context program — one that talks to the model in imperative English rather than demonstrating by analogy. The mechanism is still the same: text in the context window steers the computation. The training has just made some forms of text steer more reliably.

The system prompt, as it exists in modern chat APIs, is a curious artefact of this evolution. It is structurally a normal part of the prompt — the model doesn't have a separate 'system' parameter inside its weights — but it is positioned and trained to behave as if it had higher authority than the user message. That 'as if' is doing all the work. When jailbreaks succeed, they succeed by breaking the 'as if' — by getting the user message to read, to the model, more authoritatively than the system message. Brown et al. didn't anticipate system prompts, but they laid out exactly the substrate that made system prompts both possible and fragile.

A worked intuition: what the model is doing on a sentence

Step inside the model for a second. You hand it a few-shot prompt for a sentiment task. Six tokens of instruction, then three demos of the form text => positive or text => negative, then a fresh sentence followed by =>. The model has 96 transformer layers (in the 175B case) and 12,288 hidden dimensions. What is happening between the input ids and the output logits?

Early layers are doing what they always do: turning token ids into context-sensitive embeddings, computing local syntax, identifying named entities, building up a representation of who is doing what to whom in each demo. By the middle layers, the model has roughly figured out that there are repeating units in the prompt — three of them with similar shape, separated by newlines. This isn't done by an explicit parser; it's done by attention patterns that have, during pretraining, learned to discover repeating structure in text.

Upper layers route demos into the answer

By the upper layers, induction-style heads are firing. Some heads have located the => separator across the three demos and are using it as an alignment anchor. Other heads are reading the demo outputs (positive, negative) and routing those token embeddings forward to influence the prediction at the final position. The final attention pattern looks roughly like: the last position attends primarily to the => of each demo and to the output token immediately after each =>, with weights that depend on how similar the new sentence's representation is to each demo's input representation.

The output is then a soft mixture of the demonstrated outputs, weighted by similarity. If the new sentence is more similar to the negative demo, mass flows toward negative. If it's more similar to the positive demos, mass flows toward positive. None of this is hand-coded. All of it is the kind of computation that happens to be useful for next-token prediction on internet text where similar repeating structures occur, and that has therefore been amortised into the weights.

This is a stylised picture and the real circuits are messier. But it is closer to how researchers in mechanistic interpretability talk about these systems than the older mental model of 'neural network as black-box function.' The model has internal structure. The internal structure is, in many cases, computing something we can name. In-context learning is one of the things it computes.

Practical lessons for prompt writers

If you take in-context learning seriously as a programming model, certain prompt-writing heuristics fall out for free. None of these are revelations to a working practitioner, but they all derive cleanly from the framing rather than being folk wisdom.

  1. Match the format the model was pretrained on, not the format you'd prefer. If the model has seen Q:/A: a million times, use Q:/A:. Don't invent your own separators because they look prettier in your codebase. The model is reading your prompt as a strange dialect of training data.
  2. Put the most informative demo last. Recency bias is real. The last demo gets the most weight, so put your sharpest example there. (And then check that the answer doesn't change if you reorder; if it does, you're relying on order in a way that will bite you in production.)
  3. Keep label distributions roughly balanced. Five positive demos and one negative demo is not a balanced few-shot prompt — it's a soft instruction to predict positive. If your real distribution is imbalanced, balance the prompt anyway and let the input do the talking.
  4. Watch your tokeniser. A leading space, a trailing newline, an emoji that splits into three tokens — these change the model's view of your prompt in ways you cannot see from the rendered text. When debugging, look at the token ids, not the string.
  5. Test prompt variants empirically, not by reading them. What looks like a clean reformulation to you may move accuracy by 15 points. The only way to know is to run it. Build the eval harness before the prompt; you'll write better prompts because you can measure them.

These are debugging heuristics for a system you cannot single-step through. They're closer to numerical-methods folklore than to software engineering. The model is the function, the prompt is the input, and the output is what you measure. You don't get to read the call stack.

What the paper got right that wasn't obvious

It's worth listing the predictions the paper made that turned out to be precisely correct, because in 2020 they were not.

  1. A single model would replace task-specific systems. This was a heretical claim. By 2024 it was the default deployment pattern for everything from autocomplete to translation to summarisation.
  2. Capabilities would emerge from scale without architectural change. GPT-3 is, architecturally, GPT-2 made bigger. The paper's main innovation was 'do the same thing, but a hundred times more.' That was not the consensus on how progress would happen.
  3. The interface would be natural language. 2018-era predictions had ML systems being controlled by structured inputs, JSON schemas, declarative configs. Brown et al. argued you could just type to it. They were right.
  4. Few-shot would mostly replace finetuning for many tasks. This one took a few years to fully arrive but it arrived. Most production NLP today is prompt engineering of some kind, often with retrieval, sometimes with light finetuning on top of an instruction-tuned base.

And the predictions the paper didn't make but that the framing implied:

  • The prompt becomes the product. Anthropic, OpenAI, and others now ship 'system prompts' and 'tool definitions' as the primary product surface. The model is a commodity; the prompt is the differentiated part.
  • Evaluation becomes the bottleneck. When the model can do anything, knowing whether it did the right thing becomes the hard problem. The paper's enormous evaluation grid was a preview of the next decade's biggest engineering problem.
  • Safety becomes interface design. Most of safety-by-RLHF is about making certain prompts not work. That's a UI problem disguised as an ML problem.

Reading the paper

The paper is long. You don't need to read all of it. The parts that matter are: Section 1 (the framing of zero/one/few-shot), Section 2.1 (the model architecture, which is just GPT-2 made huge), Section 3.1-3.2 (cloze and translation results, which establish the basic pattern), Section 3.9 (the news-article generation experiments, where humans couldn't reliably tell the model's articles from real ones — the moment 'this is going to be a product' became visible), and Section 6 (limitations, which is the most prescient few pages of the whole thing).

Skip the benchmark tables unless you want a 2020 snapshot of what was hard for a 175B model. They date badly. The framing does not.

Humans do not generally require large supervised datasets to learn most language tasks — a brief directive in natural language (e.g. "please tell me if this sentence describes something happy or something sad") or at most a tiny number of demonstrations is often sufficient to enable a human to perform a new task to at least a reasonable degree of competence.

Brown et al., 2020 (Section 1)

The exercise that pays for itself: pick a task you don't think a frozen model should be able to do. Write three demonstrations of it. Paste them into a current model. You will learn more about what in-context learning is by watching a 5B-parameter model handle your weird task with three examples than by reading any number of follow-up papers. Then change the order of the demonstrations. Then change the separator. Then drop one of the demos. Watch the answer move. That is the GPT-3 paper, lived rather than read.


Five years on, the paper reads as the moment the field's centre of gravity shifted from training to prompting, from architectures to data, from task-specific to general-purpose. The shift wasn't complete in 2020 and isn't complete now. But every developer who has ever written a system prompt, every researcher who has run an eval grid, every product manager who has shipped a chat feature is working in the conceptual world Brown et al. opened up. The interface to machine learning became text. The field has been figuring out what that means ever since.