【论文笔记】Attention Is All You Need
Attention Is All You Need
Abstract
1 Introduction
2 Background
3 Model Architecture

3.1 Encoder and Decoder Stacks
3.2 Attention
3.2.1 Scaled Dot-Product Attention

3.2.2 Multi-Head Attention
3.2.3 Applications of Attention in our Model
3.3 Position-wise Feed-Forward Networks
3.4 Embeddings and Softmax
3.5 Positional Encoding
4 Why Self-Attention
Layer Type | Complexity per Layer | Sequential Operations | Maximum Path Length |
---|---|---|---|
Self-Attention | \(O(n^2 \cdot d)\) | \(O(1)\) | \(O(1)\) |
Recurrent | \(O(n \cdot d^2)\) | \(O(n)\) | \(O(n)\) |
Convolutional | \(O(k \cdot n \cdot d^2)\) | \(O(1)\) | \(O(log_k(n))\) |
Self-Attention (restricted) | \(O(r \cdot n \cdot d)\) | \(O(1)\) | \(O(n/r)\) |
5 Training
5.1 Training Data and Batching
5.2 Hardware and Schedule
5.3 Optimizer
5.4 Regularization
6 Results
6.1 Machine Translation
Model | BLEU | Training Cost (FLOPs) | ||
---|---|---|---|---|
EN-DE | EN-FR | EN-DE | EN-FR | |
ByteNet [18] | 23.75 | |||
Deep-Att + PosUnk [39] | 39.2 | 1.0 · 1020 | ||
GNMT + RL [38] | 24.6 | 39.92 | 2.3 · 1019 | 1.4 · 1020 |
ConvS2S [9] | 25.16 | 40.46 | 9.6 · 1018 | 1.5 · 1020 |
MoE [32] | 26.03 | 40.56 | 2.0 · 1019 | 1.2 · 1020 |
Deep-Att + PosUnk Ensemble [39] | 40.4 | 8.0 · 1020 | ||
GNMT + RL Ensemble [38] | 26.30 | 41.16 | 1.8 · 1020 | 1.1 · 1021 |
ConvS2S Ensemble [9] | 26.36 | 41.29 | 7.7 · 1019 | 1.2 · 1021 |
Transformer (base model) | 27.3 | 38.1 | 3.3 · 1018 | |
Transformer (big) | 28.4 | 41.8 | 2.3 · 1019 |
6.2 Model Variations
\(N\) | \(d_\text{model}\) | \(d_\text{ff}\) | \(h\) | \(d_k\) | \(d_v\) | \(P_{drop}\) | \(\epsilon_{ls}\) | train steps | PPL (dev) | BLEU (dev) | params ×106 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
base | 6 | 512 | 2048 | 8 | 64 | 64 | 0.1 | 0.1 | 100K | 4.92 | 25.8 | 65 |
(A) | 1 | 512 | 512 | 5.29 | 24.9 | |||||||
4 | 128 | 128 | 5.00 | 25.5 | ||||||||
16 | 32 | 32 | 4.91 | 25.8 | ||||||||
32 | 16 | 16 | 5.01 | 25.4 | ||||||||
(B) | 16 | 5.16 | 25.1 | 58 | ||||||||
32 | 5.01 | 25.4 | 60 | |||||||||
(C) | 2 | 6.11 | 23.7 | 36 | ||||||||
4 | 5.19 | 25.3 | 50 | |||||||||
8 | 4.88 | 25.5 | 80 | |||||||||
256 | 32 | 32 | 5.75 | 24.5 | 28 | |||||||
1024 | 128 | 128 | 4.66 | 26.0 | 168 | |||||||
1024 | 5.12 | 25.4 | 53 | |||||||||
4096 | 4.75 | 26.2 | 90 | |||||||||
(D) | 0.0 | 5.77 | 24.6 | |||||||||
0.2 | 4.95 | 25.5 | ||||||||||
0.0 | 4.67 | 25.3 | ||||||||||
0.2 | 5.47 | 25.7 | ||||||||||
(E) | positional embedding instead of sinusoids | 4.92 | 25.7 | |||||||||
big | 6 | 1024 | 4096 | 16 | 0.3 | 300K | 4.33 | 26.4 | 213 |
6.3 English Constituency Parsing
Parser | Training | WSJ 23 F1 |
---|---|---|
Vinyals & Kaiser el al. (2014) [37] | WSJ only, discriminative | 88.3 |
Petrov et al. (2006) [29] | WSJ only, discriminative | 90.4 |
Zhu et al. (2013) [40] | WSJ only, discriminative | 90.4 |
Dyer et al. (2016) [8] | WSJ only, discriminative | 91.7 |
Transformer (4 layers) | WSJ only, discriminative | 91.3 |
Zhu et al. (2013) [40] | semi-supervised | 91.3 |
Huang & Harper (2009) [14] | semi-supervised | 91.3 |
McClosky et al. (2006) [26] | semi-supervised | 92.1 |
Vinyals & Kaiser el al. (2014) [37] | semi-supervised | 92.1 |
Transformer (4 layers) | semi-supervised | 92.7 |
Luong et al. (2015) [23] | multi-task | 93.0 |
Dyer et al. (2016) [8] | generative | 93.3 |
7 Conclusion
Attention Visualizations




