NeurIPS 2017 · seq2seq · attention-only architecture

Attention Is All You Need

The Transformer did not just improve translation. It replaced step-by-step recurrence with a model where every token can look everywhere at once — which is why modern language models scale the way they do.

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin · Google Brain & Google Research · 2017

📄 Original PDF arXiv ↗ ← Library

Transformer Self-attention Parallel training Machine translation

Why this page feels different

Interactive toy attention map
Stepper for encoder / decoder flow
Live complexity and path-length explorer
Animated positional encoding plot

TL;DR

The short version

The paper asks a brutal question: if attention already lets one token directly reach any other token, why keep recurrence or convolution at all? The answer was the Transformer — an encoder-decoder built from self-attention, cross-attention, feed-forward layers, residual connections, and positional encodings. It set new translation records while training in a much more parallel way than RNNs.

28.4

BLEU on WMT14 EN→DE (Transformer big)

41.8

BLEU on WMT14 EN→FR (Transformer big)

1 step

Maximum path length between two positions in self-attention

8 × P100

Hardware used to train the big model in 3.5 days

Problem

Why recurrence became the bottleneck

RNNs process token 1, then token 2, then token 3, and so on. That sequential chain creates two problems at once:

Poor parallelism. You cannot compute the hidden state at step t before steps 1…t−1 finish.
Long dependency paths. If word 1 needs to influence word 20, the signal has to survive many hops.

Convolutions improve parallelism, but distant tokens still need multiple layers before they can interact. Self-attention cuts that path to one hop.

RNN

Good sequence bias, but inherently sequential.

CNN seq2seq

More parallel, but information still travels layer by layer.

Self-attention

Any token can immediately reference any other token.

Interactive concept demo

Attention playground

This is a toy attention map, not a literal weight dump from the paper. Click a token and see which other words it leans on when building its next representation.

Method

Transformer anatomy, step by step

Think of the model as a pipeline. Click each stage to see what role it plays.

Encoder ×6

Input + positional encoding

Multi-head self-attention

Position-wise feed-forward

Residual + LayerNorm around each sub-layer

⇄

Decoder ×6

Masked self-attention

Cross-attention into encoder output

Linear + softmax for next-token probabilities

Residual + LayerNorm around each sub-layer

Math

The core equations

Scaled dot-product attention

\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V

Q are queries, K are keys, and V are values. QKᵀ scores how aligned each query is with every key. Dividing by √d_k prevents large dot products from saturating the softmax.

Multi-head attention

\operatorname{MultiHead}(Q,K,V)=\operatorname{Concat}(\text{head}_1,\dots,\text{head}_h)W^O

\text{head}_i=\operatorname{Attention}(QW_i^Q,KW_i^K,VW_i^V)

Instead of one huge lookup, the model performs several smaller ones in parallel. Different heads can specialize in different patterns.

Position-wise feed-forward

\operatorname{FFN}(x)=\max(0,xW_1+b_1)W_2+b_2

This is the same two-layer MLP applied independently at every position. Attention mixes information across positions; the FFN transforms each position locally.

Sinusoidal positional encoding

PE_{(pos,2i)}=\sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right),\quad PE_{(pos,2i+1)}=\cos\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Because attention alone has no idea which token came first, position gets added to the embedding using sinusoids at many frequencies.

Why it scales

Parallelism vs path length explorer

The paper compares self-attention to recurrent and convolutional layers along three axes. Drag the controls to see why path length and sequential depth mattered so much.

Sequence length 64 Model width 512

Order information

Positional encoding explorer

The Transformer adds position information to token embeddings instead of learning sequence order through recurrence. Explore how different dimensions oscillate at different frequencies.

Position 24

Results

What the paper actually achieved

These are the headline translation numbers reported in the paper. The big Transformer beats the strongest previously published results in both WMT14 benchmarks.

Training recipe highlights

Base model: 6 encoder + 6 decoder layers, d_model=512, h=8, d_ff=2048.
Optimizer: Adam with β₁=0.9, β₂=0.98, ε=1e-9, plus 4000-step warmup.
Regularization: dropout and label smoothing ε=0.1.
Big model: ~213M parameters, trained for 3.5 days on 8 P100 GPUs.

Plain-language intuition

Why the idea clicked

RNNs are telephone

A message passes guest to guest. Every handoff risks losing detail, and nobody can skip ahead.

Attention is a room-wide lookup

Every token can directly ask every other token: “Who matters for me right now?”

Multi-head means multiple specialists

Instead of one monolithic view, several heads can focus on different relationships in parallel.

Caveats

Limitations and open questions

Quadratic attention cost. Full self-attention needs work and memory proportional to sequence length squared.
Order is injected, not innate. Positional encodings are necessary because attention alone is permutation-invariant.
The 2017 paper focused on translation and parsing. The later explosion into LLMs, vision, audio, and multimodal models came after this result.
Interpretability is suggestive, not complete. Attention maps can be useful, but they are not a full explanation of model behavior.

Takeaways

What to remember

1.Attention can replace recurrence and convolution for sequence transduction.

2.Short path length and parallel training were the strategic win.

3.Scaled dot-product attention, multi-head projections, and positional encodings became core building blocks for modern deep learning.

4.This paper is the bridge between classic machine translation and the modern LLM era.

Links & citation

Original sources

📄 Local PDF Raw text source arXiv ↗

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30.