Attention Is All You Need
Attention makes sequence modeling a content-addressable memory problem.
Why it matters
Replaced recurrence with self-attention and made scalable sequence modeling the default substrate for modern AI.
Mental model
Imagine every token writing a small query to the rest of the sentence: who has information I need right now? Keys answer that query, values carry the payload, and the weighted sum becomes the token's updated state.
Core concept
Tokens route information to each other through learned query-key-value matching.
Mechanism
- Embed tokens into vectors that carry identity, position, and local context.
- Project each vector into query, key, and value spaces so matching and payload are separated.
- Take query-key dot products, normalize them with softmax, and use the weights to blend values.
- Repeat this in multiple heads so syntax, reference, locality, and long-range dependencies can be represented in parallel.
Frontier Move
Modern frontier work asks how far this content-addressable memory can stretch: longer context, cheaper attention, multimodal tokens, and reasoning traces that use attention over intermediate work.