All-attention network - 36 layers | 1.08 | Augmenting Self-attention with Persistent Memory | |
12L Transformer + 8K adaptive span | 1.11 | Adaptive Attention Span in Transformers | |
12-layer Character Transformer Model | 1.18 | Character-Level Language Modeling with Deeper Self-Attention | |
All-attention network - 18 layers | 1.11 | Augmenting Self-attention with Persistent Memory | |
td-LSTM (Zhang et al., 2016) | 1.63 | Architectural Complexity Measures of Recurrent Neural Networks | - |
Large mLSTM +emb +WN +VD | 1.27 | Multiplicative LSTM for sequence modelling | - |
Transformer-XL + RMS dynamic eval + decay | 1.038 | Dynamic Evaluation of Transformer Language Models | |
24L Transformer + 8K adaptive span | 1.07 | Adaptive Attention Span in Transformers | |
64-layer Character Transformer Model | 1.13 | Character-Level Language Modeling with Deeper Self-Attention | |