adversarial + AWD-LSTM-MoS + dynamic eval | 35M | 38.65 | 40.27 | Improving Neural Language Modeling via Adversarial Training | |
GPT-2 (fine-tuned) | 1542M | 15.17 | 15.69 | Hydra: A System for Large Multi-Model Deep Learning | |
AWD-LSTM + dynamic eval | 33M | 44.3 | 46.4 | Dynamic Evaluation of Neural Sequence Models | |
Grave et al. (2016) - LSTM | - | 99.3 | - | Improving Neural Language Models with a Continuous Cache | |
SparseGPT (175B, 50% Sparsity) | - | 8.21 | - | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | |
FRAGE + AWD-LSTM-MoS + dynamic eval | 35M | 39.14 | 40.85 | FRAGE: Frequency-Agnostic Word Representation | |
GL-LWGC + AWD-MoS-LSTM + dynamic eval | 38M | 40.46 | 42.19 | Gradual Learning of Recurrent Neural Networks | |
AWD-FWM Schlag et al. (2020) | 37M | 61.65 | 54.48 | Learning Associative Inference Using Fast Weight Memory | |
AWD-LSTM + continuous cache pointer | 33M | 52.0 | 53.8 | Regularizing and Optimizing LSTM Language Models | |
Melis et al. (2017) - 1-layer LSTM (tied) | 24M | 65.9 | 69.3 | On the State of the Art of Evaluation in Neural Language Models | |
Inan et al. (2016) - Variational LSTM (tied) (h=650) | - | 87.7 | 92.3 | Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling | |
AWD-LSTM-MoS + dynamic eval | 35M | 40.68 | 42.41 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | |