GL-LWGC + AWD-MoS-LSTM + dynamic eval | 26M | 46.34 | 46.64 | Gradual Learning of Recurrent Neural Networks | |
Inan et al. (2016) - Variational RHN | - | 66.0 | 68.1 | Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling | |
AWD-LSTM-MoS + Partial Shuffle | 22M | 53.92 | 55.89 | Partially Shuffling the Training Data to Improve Language Models | |
AWD-LSTM-DOC + Partial Shuffle | 23M | 52.0 | 53.79 | Partially Shuffling the Training Data to Improve Language Models | |
Gal & Ghahramani (2016) - Variational LSTM (medium) | - | 79.7 | 81.9 | A Theoretically Grounded Application of Dropout in Recurrent Neural Networks | |
adversarial + AWD-LSTM-MoS + dynamic eval | 22M | 46.01 | 46.63 | Improving Neural Language Modeling via Adversarial Training | |
2-layer skip-LSTM + dropout tuning | 24M | 55.3 | 57.1 | Pushing the bounds of dropout | |
Trellis Network | - | 54.19 | - | Trellis Networks for Sequence Modeling | |
Mogrifier LSTM + dynamic eval | 24M | 44.9 | 44.8 | Mogrifier LSTM | |
AWD-LSTM + dynamic eval | 24M | 51.1 | 51.6 | Dynamic Evaluation of Neural Sequence Models | |
Recurrent highway networks | 23M | 65.4 | 67.9 | Recurrent Highway Networks | |
FRAGE + AWD-LSTM-MoS + dynamic eval | 22M | 46.54 | 47.38 | FRAGE: Frequency-Agnostic Word Representation | |
AWD-LSTM + continuous cache pointer | 24M | 52.8 | 53.9 | Regularizing and Optimizing LSTM Language Models | |
Past Decode Reg. + AWD-LSTM-MoS + dyn. eval. | 22M | 47.3 | 48.0 | Improved Language Modeling by Decoding the Past | - |