All-attention network - 36 layers | 1.08 | Augmenting Self-attention with Persistent Memory | - |
12L Transformer + 8K adaptive span | 1.11 | Adaptive Attention Span in Transformers | - |
12-layer Character Transformer Model | 1.18 | Character-Level Language Modeling with Deeper Self-Attention | - |
All-attention network - 18 layers | 1.11 | Augmenting Self-attention with Persistent Memory | - |
td-LSTM (Zhang et al., 2016) | 1.63 | Architectural Complexity Measures of Recurrent Neural Networks | - |
Large mLSTM +emb +WN +VD | 1.27 | Multiplicative LSTM for sequence modelling | - |
Transformer-XL + RMS dynamic eval + decay | 1.038 | Dynamic Evaluation of Transformer Language Models | - |
24L Transformer + 8K adaptive span | 1.07 | Adaptive Attention Span in Transformers | - |
64-layer Character Transformer Model | 1.13 | Character-Level Language Modeling with Deeper Self-Attention | - |