64-layer Character Transformer Model | 1.11 | 44M | Character-Level Language Modeling with Deeper Self-Attention | |
Longformer (12 layers, h=512) | 1.00 | 41M | Longformer: The Long-Document Transformer | |
SHA-LSTM (4 layers, h=1024, no attention head) | 1.33 | 51M | Single Headed Attention RNN: Stop Thinking With Your Head | |
Transformer (12 layers, 8k adaptive span) | 1.02 | 39M | Adaptive Attention Span in Transformers | |
GPT-2 (48 layers, h=1600) | 0.93 | 1542M | Language Models are Unsupervised Multitask Learners | - |
Compressive Transformer (24 layers) | 0.97 | 277M | Compressive Transformers for Long-Range Sequence Modelling | |
Transformer (24 layers, 8k adaptive span) | 0.98 | 209M | Adaptive Attention Span in Transformers | |
Transformer-XL (24 layers, RMS dynamic eval, decay) | 0.940 | 277M | Dynamic Evaluation of Transformer Language Models | |