Machine Learning / 2017 / arXiv

Attention Is All You Need

Attention Is All You Need — Vaswani et al.

Recurrence said: compress the past into one hidden state and hope. Attention said: keep everything around and let each token decide what it needs.

Read original paper Back to papers

Before this paper, sequence models had a bottleneck-shaped problem. An RNN reads tokens one at a time and folds everything it has seen so far into a single fixed-size hidden state. By the time you reach token 200, the contribution of token 1 has been multiplied by a recurrent matrix 199 times. In practice this means the network is constantly fighting its own forgetting. You can patch it with LSTMs, GRUs, attention-on-top-of-RNN, but the underlying ask is the same: compress the past, then make a decision.

There's a second, deeper problem with that picture. Recurrence is sequential by definition. To compute the hidden state at step t, you need the hidden state at step t-1. Which means you cannot parallelize the forward pass over the time axis. On a modern GPU that's catastrophic — your fancy matrix-multiply hardware sits idle while you walk a sentence one token at a time. People got around this with truncated backprop, bucketing, smarter LSTM kernels, but the architecture was at war with the hardware.

And there was a third problem: long-range dependencies. The sentence The keys that the woman who ran across the street dropped into the fountain were rusty has a verb (were) that has to agree with a noun (keys) thirteen tokens earlier. An RNN has to carry that singular/plural fact through every intervening hidden state, dodging interference from woman (singular), street (singular), fountain (singular). LSTM gates help, attention-on-top-of-RNN helps more, but the underlying mechanism — propagate, don't address — keeps fighting the structure of language. Every linguistically interesting phenomenon is a long-range dependency.

The transformer is what happens when you stop compressing. Keep all the past tokens around as vectors. When token i needs information, it broadcasts a question to every other token, gets graded answers back, and mixes them. There is no hidden state being squeezed through a narrow pipe. The pipe is the whole sentence.

That's the entire idea. Everything else — the multi-head business, the positional encodings, the feed-forward layers between attention blocks — is engineering on top of that one move. So let's actually do the move.

The move: queries, keys, values

Take a sentence and turn each token into a vector — call it the embedding. Now from each token vector produce three new vectors via three learned linear projections: a query q, a key k, and a value v. The names are good. The query is what this token is asking about. The key is what each token offers as a label. The value is the actual content this token is willing to hand over.

Imagine you walk into a library and ask the librarian for books about Roman aqueducts. Your spoken request is the query. Each book on the shelves has a spine label — Roman engineering, Italian cookbooks, space exploration — and these labels are the keys. The actual contents of the books are the values. The librarian compares your query against every spine, picks the matches, and hands you the contents of those books, weighted by how good a match each spine was. That's attention. The only difference is that everything is a vector and everything is differentiable.

Concretely: if your embedding is a 512-dimensional vector and d_k is 64, then W_Q, W_K, and W_V are each 512×64 matrices. Multiplying the embedding by each of them gives you a 64-dim query, a 64-dim key, a 64-dim value. Three matrices, three vectors, one token underneath. That's it. There's no recurrence, no convolution, no exotic structure. Three linear layers.

Q / K / V projections

One vector, three roles

Each token starts as a single embedding vector. Three learned matrices W_Q, W_K, W_V project it into a query, a key, and a value. Drag the W_Q sliders to rotate the query and watch the dot product against each key shift.

W_Q[0][0]0.95

W_Q[0][1]-0.25

top key folds · 37%

The same input embedding gets shipped through three different linear layers. W_Q shapes what this token is asking; W_K shapes what each token advertises; W_Vshapes what each token actually contributes. Three matrices, three roles, one vector underneath.

One embedding, three projections. W_Q shapes the question, W_K shapes the label, W_V shapes the payload. Slide the W_Q entries to rotate the query and watch which key it matches best.

Why three projections instead of one

Why three different projections? Couldn't a token just dot-product its embedding against everyone else's embedding? It could. But then matching and content would be locked together. The model would have no way to say I'm looking for a verb but I want to receive the verb's tense and aspect, not its surface form. Splitting Q, K, V into separate projections gives the model a free axis: matching is one subspace, content is another. This decoupling is what lets a single attention head do something semantically interesting rather than just nearest-neighbour-by-embedding.

Attention then does the obvious thing. For token i, compute the dot product of its query q_i against every other token's key k_j. That gives you a list of compatibility scores — how much does i's question match j's label? Pass those scores through a softmax so they become a probability distribution. Use that distribution to take a weighted sum of all the value vectors. The result is i's new representation.

In one line: Attention(Q, K, V) = softmax(Q Kᵀ / √d_k) V. Five symbols and a normalization. That's the paper's core equation. If you understand that line, you understand 90% of every transformer that has shipped since 2017.

Unpacking the dimensions

Let's actually unpack the dimensions because that's where the picture clicks for most people. Suppose you have n tokens and each token's embedding is d_model dimensional. After the projections, Q is n × d_k, K is n × d_k, V is n × d_v. The product Q Kᵀ is then n × n — one row per query, one column per key, one scalar per pair. Apply softmax row-wise, multiply by V (which is n × d_v), and you get n × d_v: a new vector for each token, the same shape it came in. Tokens go in, refined tokens come out, the layer is shape-preserving. That's why you can stack them.

And note what's not in that equation: time, recurrence, ordering. The whole computation is one giant batched matrix multiplication. On a GPU, this is a single CUDA kernel call — not a loop over the sequence. That's the parallelism that broke RNNs' back.

Read it again with the right framing: every token is doing a soft database lookup. The query is the search; the keys are the table's index; the values are the rows. Softmax just makes "closest match" differentiable so we can train the projections. The closest analogy isn't a neural network — it's a hashmap with continuous keys.

Below: pick a query token and watch which keys it pulls from. The temperature slider sharpens or softens the softmax. Crank it hard and the head becomes a pointer — one source dominates. Soften it and the head becomes a smooth mixer — evidence blends across the row. A real model has many heads at every layer, and different heads learn to behave at different points on this spectrum.

scaled dot-product attention

A query reads the row, the row mixes the values

Click any token to make it the query. The chosen row of the attention matrix shows how much of each key’s value vector flows into the new representation. Drag temperature to sharpen the row into a pointer or soften it into an averager.

Each row sums to 1 — that’s softmax. The flow lines below trace how every key’s value vector contributes to the query’s output, with width proportional to weight. Switch heads to see how different Q/K projections produce different routing patterns: Syntax attends by agent → action → patient. Crank temperature toward 0.2 to watch the head sharpen into a pointer; push it past 2.0 to see it dissolve into a uniform mixer.

Each row is one query. Each cell shows how much of that query's output comes from the corresponding column's value. The row sums to 1 — that's softmax. Switch heads to see the same six tokens routed by three different learned strategies.

Why heads, and why more than one

If you only had one attention head, every token's question would have to be a single vector. But "what should I attend to" is rarely one question. In a sentence like the enzyme binds RNA, then folds, the word folds might want to know its grammatical subject (probably enzyme), the molecule being acted on (RNA), and roughly how many tokens back the verb chain started. These are different lookups. They want different keys to light up.

Multi-head attention runs h parallel attention operations on lower-dimensional projections of the same input, then concatenates the results and passes them through one more linear layer. Each head gets its own W_Q, W_K, W_V, so it can learn its own kind of question. The original paper used h=8 with d_k=d_v=64 per head, summing to the model's 512 dim. Modern models scale both axes — GPT-3 has 96 heads of dim 128 in its largest layers.

Cheap parallel specialization. Eight different ways of asking who matters here? run at once, each writing its own answer back, and the final linear layer learns to weigh them. Crucially, each head is computing on a smaller subspace — d_k is 64, not 512 — so the cost of eight heads is the same as one big head. You get diversity for free.

multi-head attention

Four heads, four different routings, one residual stream

Each head runs its own attention with its own Q/K projections. Below: four small attention matrices over the same sentence. Toggle overlay to see what the residual stream actually receives — the sum of all heads.

sentence: The scientist who studied RNA folds it

Look at Head 4: the pronoun it reaches back to RNA. Nobody told the model what coreference is. The Q/K projections for that head simply learned to make those two tokens dot-product to a large number. With eight or sixteen heads in a real model, you get many such specialized lookups, all running in parallel.

Four hand-tuned heads on the same sentence. Each head's Q/K projections route information by a different relationship — position, syntax, locality, coreference. Toggle the residual sum to see what gets handed to the next layer.

When people open a trained transformer and stare at the heads, they find some that track syntax (which noun goes with which verb), some that do positional things (attend to the previous token, attend to the start), some that resolve coreference ("it" → the right antecedent), and a lot that are doing things we don't have clean names for. Anthropic's interpretability work has found induction heads that implement a specific copying algorithm; successor heads that increment numbers; name-mover heads that route the right person's name to the right verb. None of these were designed in. The architecture allows specialization without anyone hand-coding it; gradient descent finds the structure.

There's a subtle point worth dwelling on: heads in a single layer can't talk to each other. Each head computes its own attention pattern independently, and they only get combined by being concatenated and passed through the output projection W_O. If two heads need to coordinate — say, one identifies the subject and another routes information based on whether-something-is-a-subject — they have to do it across layers, with the residual stream as the medium. Head 3 in layer 5 reads what head 2 in layer 4 wrote. The depth of the network is, partly, the budget for compositional reasoning across heads.

This is one of the things mechanistic interpretability has actually nailed down. Induction heads — the circuit that lets a transformer copy patterns from earlier in the context — turn out to be a two-layer construction. A layer-1 head writes "the previous token was X" into the residual stream. A layer-2 head reads that signal and uses it to attend back to the previous occurrence of the current token. Neither layer can do induction alone. The composition is the algorithm. You watch this circuit form during training in a sharp transition; before, the model can't do in-context copying, after, it can. People have started calling this kind of phase change a grokking event.

Position has to be added in by hand

Here's a thing that surprises people: attention is permutation-equivariant. Shuffle the tokens and the attention math doesn't care — it would compute the same outputs in a shuffled order. The set of (Q, K, V) triples is treated as a set. Which means, by default, the transformer has no idea that the dog bit the man is different from the man bit the dog.

The fix is positional encodings. Add a position-dependent vector to each input embedding so that token #3 looks different from token #17, even if the words are identical. The original paper used fixed sinusoids of varying frequency — dim 2k gets sin(pos / 10000^(2k/d)), dim 2k+1 gets cos of the same. The choice looks weird until you stare at it.

Why sinusoids of geometrically-increasing wavelength? Because two such vectors have an inner product that depends only on the difference of their positions, not the absolute positions. Which means that even though the encoding is absolute, the attention dot product can read out relative position for free. The model gets distance-awareness without anyone teaching it the concept of distance.

sinusoidal positional encoding

How a token learns where it is

Each row is one position. Each column is one dimension. Low-index dimensions wiggle fast (carrying fine-grained position); high-index dimensions wiggle slow (carrying coarse position). Drag the sliders to see how the picture stretches.

Sequence length48

Encoding dim32

Two PE vectors that are close in position have a high inner product; far-apart positions have a near-zero one. Because attention scores are themselves dot products, the model can read distance directly from the PE-augmented embeddings without ever being told what position means. This is also why later schemes — RoPE, ALiBi — keep the sinusoidal flavor: it makes relative position emerge from absolute position for free.

The heatmap is the positional encoding matrix: rows are positions, columns are dimensions. Below it: similarity between PE(p) and PE(p+Δ) as a function of Δ. Smooth decay with distance is what lets the model use absolute encodings as a relative-position signal.

Later work tried learned positional embeddings (BERT), then relative position biases (T5), then Rotary Position Embeddings (RoPE) used in LLaMA and most modern open models, then ALiBi which doesn't add encodings at all but applies a distance-dependent bias to the attention scores. The architecture choice matters less than the fact that something has to break the symmetry. The sinusoidal version is the original; it still works; it has the nice extrapolation property that you can ask the model about positions longer than it was trained on without retraining the encoding.

There's a deeper question that the positional encoding choice raises: what is position, in language? It's not just "the i-th word." Linguistic structure is partly tree-shaped (a relative clause is a constituent, even if it's seven tokens long) and partly graph-shaped (anaphora links it back to a referent). Neither sinusoidal nor RoPE captures these. They give you a one-dimensional ruler. Real syntactic structure is multidimensional and the model has to learn to construct it on top of the linear position signal. This is one of the things multi-head attention is, in practice, doing — building tree-like structure across layers using flat positional cues as scaffolding.

Stack the block, add residuals

One attention layer mixes information across positions once. To get depth, you stack: attention → feed-forward → attention → feed-forward, and so on. The feed-forward is a per-token MLP — it doesn't move information between tokens, it just lets each token's vector be transformed. Specifically, two linear layers with a ReLU (or GELU, or SwiGLU in modern variants) in between, with the inner dimension typically 4× the model dim. So the FFN has way more parameters than the attention itself; in a typical transformer two-thirds of the parameter count lives in FFN layers.

Why this alternation? Attention does cross-position mixing — who is talking to whom — but it's a fundamentally linear operation in V (softmax weights are scalars; the values get linearly combined). The feed-forward provides per-position non-linear processing — now that I have the right info, what do I do with it. Mixing across tokens, then computing on each token, then mixing again. That's the rhythm.

Residuals make depth trainable

Two structural details matter beyond the alternation. First, residual connections: every block computes x + f(x) rather than f(x). This means each layer is editing the running representation rather than replacing it. The thing you can think of as flowing through the network is a per-token vector — sometimes called the residual stream — and each attention head reads from it and writes to it. Layers add their contribution; the original information is preserved by default.

This residual-stream picture is more than aesthetics. Without residuals, a deep transformer wouldn't train — gradients can't flow through dozens of multiplicative layers. With them, every layer has a clean gradient path back to the input, and every head's contribution can be isolated and inspected. A lot of mechanistic interpretability work falls out of this picture: "which heads write the answer to who is the subject of this verb?" becomes a well-defined question because each head's write to the stream is an additive term you can read off directly.

Second, layer norm, which keeps the magnitudes from exploding as the residual stream accumulates contributions. The original paper put layer norm after each residual addition (post-norm). Almost everyone now puts it before each sub-layer (pre-norm) because pre-norm trains more stably at depth. It's a one-line code change with enormous downstream consequences for how deep you can go.

The residual stream as scratchpad

Stack a dozen of these blocks and you have an encoder. The residual stream after the last layer is your contextualized representation: each token now knows what it is, where it sits, what its neighbours are, and a great deal about how it relates to the rest of the sentence. That stream is what gets handed to whatever's downstream — a classifier, a decoder, a language modeling head.

A useful mental picture: think of the residual stream as a kind of running scratchpad per token. Each layer reads the scratchpad, computes a delta, and adds the delta back. Some heads are scribbling syntactic facts (this is the subject); some are scribbling semantic facts (this refers to the same entity as token 7); the FFN is doing per-token computation that doesn't need other tokens (if this is the past tense of "go", increment the temporal pointer). After 96 layers of GPT-3, that scratchpad contains, encoded somehow in 12288 dimensions, everything the model is willing to commit to about each token's role. The final unembedding turns the last token's scratchpad into a probability distribution over the vocabulary — and that distribution is the model's prediction.

Encoder vs decoder: same gears, different masks

The original paper proposed an encoder-decoder pair for translation. The encoder eats the source sentence and produces a stack of contextualized vectors. The decoder generates the target sentence one token at a time, with each new token attending both to its own (already-produced) prefix and to the encoder's output. That second cross-attention — decoder-to-encoder — is how Hello in the source gets into Bonjour in the target.

But here's the thing: the encoder block and the decoder block are nearly identical machinery. The only architectural difference inside the self-attention is a mask. In the encoder, every token can attend to every other token. In the decoder, token i can only attend to tokens 0 through i — it must not see the future, because at inference time the future doesn't exist yet. Implementation-wise, you set the upper triangle of the score matrix to negative infinity before softmax, so those weights become zero.

encoder vs decoder

The same matrix; one shape of mask makes it a language model

In encoder mode, every token can read every other token. Flip to decoder and the upper triangle is masked out — token iis only allowed to look at tokens 0..i. That single change is the difference between BERT and GPT.

GPT-style: predict next token

Look at row 0 (The): in causal mode it can only look at itself, so the row collapses to a single 100% cell. Row 5 (mat) sees all six tokens. Without the mask the same row spreads attention across the whole sentence in both directions. Same architecture; one matrix of zeros decides whether you've built BERT or GPT.

Same attention scores, two masks. Encoder mode: every cell counts. Decoder mode: the upper triangle is wiped out. That single change is the difference between BERT (encoder-only) and GPT (decoder-only).

This is the cleanest possible illustration of how generic the transformer block is. By 2018, BERT had stripped out the decoder entirely and used just the encoder for representation learning — the contextualized vectors are useful by themselves. A few months later, GPT did the opposite, stripping out the encoder and using just the decoder for autoregressive language modeling. By 2020 the field had mostly converged on the decoder-only design — same block, causal mask everywhere, and a language modeling objective. That's what GPT-3, Claude, LLaMA, and basically every frontier LLM today is.

The encoder-decoder split lives on in places where you have a clear input/output asymmetry — translation, summarization, speech recognition. T5 and Whisper are the canonical examples. But for the open-ended predict the next token objective that produced the foundation model era, decoder-only ate the world. One mask, two model classes.

What the result actually proved

The headline number was BLEU on machine translation — 28.4 on WMT 2014 English-to-German, beating the previous best by 2 BLEU points while training in a fraction of the wall-clock time. That number is not why the paper matters. The paper matters because it removed the sequential bottleneck of RNNs. Attention is fully parallel: every (i, j) pair in the matrix can be computed independently, which is exactly the workload a GPU wants. RNNs cannot be parallelized over the time axis without changing what they compute; transformers can, and the result is one big batched matmul.

This is the thing that made scaling work. The reason transformers ate the world isn't that they are smarter per parameter — by some measures (sample efficiency, parameter efficiency on small data) they aren't. It's that they let you spend a million GPU-hours on one model and actually use them. RNN training throughput plateaus because each token has to wait for the previous one. Transformer training throughput is bottlenecked only by memory bandwidth and arithmetic, both of which scale with hardware spend. Every later result you've heard of — GPT-2, GPT-3, ChatGPT, Claude, Gemini, all of them — is downstream of this one architectural decision making training compute spendable.

Look at it from the chip designer's perspective. Modern GPUs and TPUs are basically very fast matrix multiplication engines with some memory hanging off them. The transformer's forward pass is a sequence of huge matmuls — Q K^T, softmax, attention-weighted V, then the FFN's two matmuls — separated by trivially-fast pointwise operations. The architecture is the workload the hardware was already optimized for. RNN was the wrong shape; transformer is the right shape; the gap shows up as orders of magnitude in tokens-per-second-per-dollar.

The line of descendants

It's worth tracing the arc, because most of the language model era is variations on this 2017 block. 2018: ELMo (still RNN, but contextual), then GPT-1 (decoder-only, autoregressive, 117M params), then BERT (encoder-only, masked language modeling, 340M). Both of the latter were trained on raw text and fine-tuned for downstream tasks. Both used the transformer block essentially unchanged.

2019: GPT-2 (1.5B), the first model to make people uncomfortable about generative text. T5 (encoder-decoder, 11B), which framed every NLP task as text-to-text. Same block, more of it. 2020: GPT-3 (175B), the moment the field realized that scaling the same architecture, with no algorithmic changes, kept producing qualitatively new behaviors — in-context learning, few-shot adaptation, chain-of-thought once you ask for it. The block didn't change. The compute did.

Scale, not algorithm, drove 2018-2022

That last sentence is the thesis of the modern era. From 2018 to 2022, the number of architectural innovations that made it into production was vanishingly small. Pre-norm vs post-norm. RoPE replacing learned positions. SwiGLU replacing GELU. Grouped-query attention for inference efficiency. None of these are conceptual leaps; they're refinements. The compounding gains came from scale: more parameters, more tokens, more compute, better data curation, longer training. Kaplan and then Chinchilla quantified the recipe. The transformer block sat at the center of all of it, getting bigger but not different.

2022 onwards: ChatGPT, Claude, Gemini, LLaMA. Internally, all decoder-only transformers. The differences are: how big, what data, what alignment procedure (RLHF, DPO, constitutional AI), what context length, what positional encoding (almost all RoPE now), what activation function (almost all SwiGLU now), what attention variant for efficiency (multi-query, grouped-query). Take a 2017 transformer block and a 2025 LLaMA block side by side; you can read both with the same mental model. The deltas are quantitative.

The block ate every other field

Outside of language: ViT (vision transformer) showed that the same block, fed image patches as tokens, beats convolutional networks at scale. AlphaFold 2 uses transformer-style attention over protein residues. Diffusion models for images stack transformers (DiT). Speech, code, music, video — same block, different tokenization. The transformer turned out to be a general-purpose mixer of sequences-of-vectors, and most of the world is reducible to sequences of vectors.

It's worth pausing on this. In 2017 there was a vision community building convnets, an NLP community building RNNs, a speech community building HMM-DNN hybrids, a graph community building message-passing networks, and a reinforcement learning community building actor-critic networks. By 2024, most of those communities had quietly migrated their architectures to some flavor of transformer. The block ate the menu. Whether that's because attention is genuinely the right primitive for sequence modeling, or because it's the primitive that fits hardware best and so attracts the most engineering effort, is a real question. Probably both.

What it doesn't fix

The attention matrix is n × n in sequence length. Compute attention over a 100k-token document and you are storing a 10-billion-entry matrix. That quadratic cost is the headline limitation. Most of the systems work since 2017 has been about defusing it. FlashAttention rearranged the algorithm so the matrix never fully materializes — you tile it through SRAM and recompute on the backward pass, trading FLOPs for memory bandwidth. This single innovation made 100k+ context length practical.

Sparse and linear attention variants try to keep the asymptotic cost down by pruning or kernelizing the dot product. Performer, Linformer, BigBird, Longformer — many of these work, none have displaced full attention at the frontier. The general lesson seems to be: full attention, executed efficiently, beats clever sparse approximations on quality, and the systems work has caught up enough that quadratic isn't the wall it once was.

The KV cache rewrites inference economics

KV-caching makes inference linear-per-token even though attention is still quadratic-per-prompt. Once you've processed the prompt, you save the keys and values; for each new generated token you only compute one new query against the existing K and V. This is why your chatbot can extend a conversation cheaply — it's not re-attending to the whole history every step, it's appending to a cache. The cache is also why long prompts are expensive (you have to fill it) and why streaming generation is fast (you're amortizing one matmul per token).

The KV cache also explains a strange thing about modern LLM economics: input tokens are often 10x cheaper than output tokens. Input tokens fill the cache once, in parallel, and contribute prefill compute proportional to n. Output tokens are generated serially, each one requiring a forward pass through the entire stack against an ever-growing cache. The decode phase is memory-bandwidth bound, not compute-bound — you're streaming gigabytes of K and V through the chip per token. Whole subfields (speculative decoding, multi-query attention, grouped-query attention, quantized KV caches, paged attention) exist solely to make decode go faster. None of this is in the original paper. All of it is downstream of the cache structure the paper's math implies.

Open problems the architecture left behind

Other things the paper didn't try to solve, but that the field has been chasing since: how to do reasoning with attention rather than just associative recall (chain-of-thought, scratchpads, tool use, search-style inference compute); how to make the residual stream do something more like working memory that persists across long inputs (state-space models like Mamba are the live competition here); how to train with reinforcement signals rather than next-token prediction (RLHF, DPO, RLAIF, RL-from-verifiable-rewards in reasoning models); how to interpret what specific heads and circuits are doing in a 70B-parameter model (Anthropic's circuits work, sparse autoencoders, attribution patching).

And then there's the long-context problem. Even with FlashAttention and clever positional encodings, models still lose the thread on documents of millions of tokens. The middle of long inputs gets ignored ("lost in the middle"). Effective context is much shorter than nominal context. RAG and tool use are partially compensating, but the underlying fact — that attention treats every token equally regardless of how relevant it is — has not been fundamentally solved.

But the central object — tokens reading from each other through learned similarity — is still here, in every frontier model, basically unchanged. That's a rare thing for a deep learning paper from 2017 to claim. AlexNet's 2012 architecture is essentially gone, replaced by ResNets which were replaced by transformers. The 2017 transformer is still load-bearing in 2026.

And there's the still-open interpretability question, which is partly a research problem and partly a safety concern. We can train a 70-billion-parameter transformer and watch it write code, debug code, summarize a contract, plan a vacation. We cannot, in any deep sense, tell you why it makes the specific choices it makes on a specific input. We have heads we've named (induction, name-mover, successor, copy-suppression). We have circuits we've reverse-engineered (indirect object identification, modular addition). But the gap between understanding a few circuits in toy models and understanding what a frontier model is doing on a real task is still enormous. Closing that gap is one of the deepest unsolved problems the architecture leaves us with.

If you read the original

The paper is short and unusually clear. The two things to study are Figure 1 (the encoder-decoder block diagram) and Section 3.2 (the attention math). Skip Section 3.1 on the first read — you don't need the full encoder-decoder framing to understand attention itself, and most modern uses are decoder-only anyway. Section 5.4 on training details is worth a look if you want to feel how much of the recipe was just engineering: warmup steps, label smoothing, beam search, byte-pair encoding.

Twenty lines of NumPy, then nanoGPT

The exercise that pays for itself: implement scaled dot-product attention from scratch in NumPy. Twenty lines. Then add multi-head, then add masking, then add positional encoding. By the time you've typed Q @ K.T / np.sqrt(d_k) once, transformer code stops being mystical. You can read any modern model's source and the only surprises are vocabulary choices — what they call layer norm, where they put it, which activation function, which positional scheme. The bones are the same bones.

After NumPy, the next exercise is Andrej Karpathy's nanoGPT — a few hundred lines of PyTorch that train a real (small) GPT on Shakespeare. Read every line. Type it out. Watch it train. The gap between I sort of understand attention and I can debug a transformer at 3am closes fast once you've seen the loss curve come down on a model you wrote yourself. Most of what looks like sophistication in production code (mixed precision, distributed training, fancy attention kernels) is engineering on top of the same hundred lines of math you just typed.

The last piece of advice: don't read the paper for the BLEU number. Read it as the moment a small lab decided to delete recurrence from a sequence model and discovered, to almost everyone's surprise, that the resulting architecture was simpler, faster on hardware that already existed, and better at the task. Most papers are increments. This one was a bet. It paid off in a way that reshaped the field, the chips, and the products. Eight authors. Eight pages. Eight years later, every time you talk to an AI, you are talking to a child of this paper.

Read the original Next: Scaling Laws for Neural Language Models