Attention Is All You Need
TL;DR
The Transformer replaces the recurrence and convolutions of earlier sequence models with attention alone. By letting every position attend directly to every other position, it removes the step-by-step bottleneck of RNNs, trains far more in parallel, and set new state-of-the-art translation scores — 28.4 BLEU on WMT 2014 English→German and 41.8 BLEU on English→French — at a fraction of the previous training cost.
Problem & Motivation
Before this paper, the best sequence-transduction (e.g. translation) models were recurrent encoder–decoders, often with attention bolted on. An RNN reads a sentence one token at a time, carrying a hidden state forward:
This sequential dependency is the core problem. To compute the state at position t you must finish positions 1…t−1 first, so:
- Training can't be parallelized within a sequence — a fatal cost on long sentences and large datasets.
- Long-range dependencies are hard — information between two distant words must survive many recurrent steps, and the path length between them grows with distance.
Convolutional models (ByteNet, ConvS2S) parallelize better but still need a number of layers that grows with distance to connect far-apart tokens. The paper asks: what if attention — which already connects any two positions in one step — were the entire model?
Key Concepts
- Attention (Query, Key, Value)
- Attention is a soft, content-based lookup. Each query is compared to every key to produce weights; the output is the weighted sum of the corresponding values. Think of a soft dictionary: the query asks "what am I looking for?", keys advertise "what I contain", values are "what I'll hand back".
- Self-attention
- Queries, keys, and values all come from the same sequence. Every token builds its new representation by looking at all tokens (including itself) in a single step — so the path between any two positions has length 1.
- Cross-attention
- Queries come from the decoder; keys and values come from the encoder. This is how the decoder consults the source sentence while generating the translation.
- Multi-head attention
- Run several attention functions ("heads") in parallel in different learned subspaces, then concatenate. Different heads can specialize — e.g. one tracks syntax, another tracks coreference.
- Positional encoding
- Attention is permutation-invariant (it has no inherent sense of order), so the model injects position information by adding fixed sinusoidal vectors to the token embeddings.
- Masking (causal)
- In the decoder's self-attention, future positions are masked out so prediction of token i can depend only on tokens < i, preserving the autoregressive property.
The Method
The Transformer is an encoder–decoder stack. The encoder maps the source sequence to a set of continuous representations; the decoder generates the target sequence one token at a time, attending both to its own past outputs and to the encoder's output. The base model uses 6 identical layers in each of the encoder and decoder.
The pieces of each layer
Scaled dot-product attention
Compare queries and keys by dot product, scale, softmax, weight the values. The whole attention map is a few big matrix multiplies — extremely parallel.
Multi-head (h = 8)
Project into 8 subspaces of size 64, attend in each, concatenate. Captures several relationship types at once.
Position-wise FFN
A 2-layer MLP applied identically to each position: d_model=512 → d_ff=2048 → 512 with ReLU. Adds nonlinear capacity.
Residual + LayerNorm
Every sub-layer outputs LayerNorm(x + Sublayer(x)), which stabilizes and speeds up training of the deep stack.
Input/output tokens are embedded into \(d_{\text{model}}=512\)-dim vectors; sinusoidal positional encodings are added so the model knows token order. The same weight matrix is shared between the two embedding layers and the pre-softmax projection.
Math Walkthrough
Scaled dot-product attention
Pack the queries, keys, and values into matrices \(Q\in\mathbb{R}^{n\times d_k}\), \(K\in\mathbb{R}^{m\times d_k}\), \(V\in\mathbb{R}^{m\times d_v}\). Attention is:
- \(QK^{\top}\) — every query dotted with every key → an \(n\times m\) grid of raw similarity scores.
- \(\tfrac{1}{\sqrt{d_k}}\) — the scaling. For large \(d_k\) the dot products grow large in magnitude, pushing softmax into regions with tiny gradients; dividing by \(\sqrt{d_k}\) counteracts this.
- \(\operatorname{softmax}(\cdot)\) — turn each query's row of scores into attention weights that sum to 1.
- Multiplying by \(V\) returns, for each query, the weighted average of the values.
Here \(n\) is the number of queries, \(m\) the number of key/value pairs; \(d_k,d_v\) are the key and value dimensions.
Multi-head attention
Instead of one attention with full-size vectors, use \(h\) heads on learned linear projections, then concatenate and project back:
with \(W_i^{Q},W_i^{K}\in\mathbb{R}^{d_{\text{model}}\times d_k}\), \(W_i^{V}\in\mathbb{R}^{d_{\text{model}}\times d_v}\), and \(W^{O}\in\mathbb{R}^{hd_v\times d_{\text{model}}}\). The paper uses \(h=8\) with \(d_k=d_v=d_{\text{model}}/h=64\), so the total cost is similar to a single full-dimension head.
Position-wise feed-forward & positional encoding
Each dimension of the positional encoding is a sinusoid of a different wavelength (geometric progression from \(2\pi\) to \(10000\cdot 2\pi\)), letting the model attend by relative positions since \(PE_{pos+k}\) is a linear function of \(PE_{pos}\).
Why attention beats recurrence — the complexity argument
The paper compares layer types on three axes. The decisive ones are the maximum path length between any two positions (shorter = easier long-range learning) and how much runs in parallel:
| Layer type | Complexity / layer | Sequential ops | Max path length |
|---|---|---|---|
| Self-attention | \(O(n^2\cdot d)\) | O(1) | O(1) |
| Recurrent | \(O(n\cdot d^2)\) | O(n) | O(n) |
| Convolutional | \(O(k\cdot n\cdot d^2)\) | O(1) | \(O(\log_k n)\) |
Self-attention connects all positions with a constant number of sequential steps and a path length of 1. When the sequence length \(n\) is smaller than the representation dimension \(d\) (the usual case in translation), it is also cheaper per layer than recurrence.
Results
On the WMT 2014 translation benchmarks, the big Transformer beat every previously published model — including ensembles — while training in 3.5 days on 8 NVIDIA P100 GPUs, a small fraction of prior state-of-the-art training cost.
| Model | EN→DE (BLEU) | EN→FR (BLEU) |
|---|---|---|
| GNMT + RL | 24.6 | 39.92 |
| ConvS2S | 25.16 | 40.46 |
| GNMT + RL (ensemble) | 26.30 | 41.16 |
| ConvS2S (ensemble) | 26.36 | 41.29 |
| Transformer (base) | 27.3 | 38.1 |
| Transformer (big) | 28.4 | 41.8 |
The big model improved EN→DE by more than 2.0 BLEU over the best prior result. Ablations in the paper show that quality drops if there are too few or too many heads, and that the learned/sinusoidal positional encodings perform essentially the same. The architecture also generalized: applied to English constituency parsing, it matched or beat prior models.
Training recipe (base model)
- Adam with \(\beta_1{=}0.9,\ \beta_2{=}0.98,\ \epsilon{=}10^{-9}\); learning rate warms up for 4000 steps then decays \(\propto \text{step}^{-0.5}\).
- Regularization: residual dropout \(P_{\text{drop}}{=}0.1\) (0.3 for the big model) and label smoothing \(\epsilon_{ls}{=}0.1\) — which hurts perplexity but helps BLEU.
- Base: ~65M params, 100k steps (~12h). Big: ~213M params, 300k steps (3.5 days).
Intuition & Analogies
The dinner-party analogy. An RNN is a game of telephone: a message is whispered down a line of guests, degrading with every hop. Self-attention is a room where everyone can talk to everyone at once — any two guests exchange information in a single step, no matter how far apart they were seated.
Queries, keys, values as search. Each word issues a query ("I'm a pronoun — who's my antecedent?"). Every other word offers a key advertising what it is, and a value with its content. Softmax decides how much to listen to each, and the word updates itself with the blend. Multiple heads are like asking several specialists the same question in parallel — a grammar expert, a topic expert, a coreference expert — and combining their answers.
Limitations & Open Questions
- Quadratic cost in sequence length. Self-attention is \(O(n^2)\) in time and memory, which becomes the bottleneck for very long sequences (documents, high-res inputs) — spawning a whole literature on efficient/sparse attention.
- Positional encoding is a patch. Order is injected rather than intrinsic; how best to encode position (absolute vs. relative) remained an open question the paper itself flagged.
- Restricted/local attention not fully explored. The authors note that for long sequences one could restrict attention to a neighborhood of size \(r\); they leave this to future work.
- Scope. The paper demonstrates translation and parsing; its claim that the architecture generalizes to other modalities (images, audio, video) was an aspiration in 2017 — later borne out by the field.
Takeaways
- Attention can be the whole model. Recurrence and convolution are not required for state-of-the-art sequence transduction.
- Parallelism is the superpower. Constant sequential depth and path length make the Transformer train dramatically faster, which is what unlocked scaling.
- Three reusable ideas: scaled dot-product attention, multi-head projections, and sinusoidal positional encodings.
- It generalized far beyond translation. The Transformer became the backbone of modern NLP and, eventually, of large language models and much of deep learning.
Links & Citation
📄 Original PDF arXiv:1706.03762 ↗
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS) 30.