HyperAI

The WikiText long-term reliance language modeling dataset contains 100 million English words, which come from Wikipedia's high-quality articles and benchmark articles.

The dataset is divided into two versions: WikiText-2 and WikiText-103. Compared with the PTB vocabulary, it is larger in scale and each word also retains the relevant original article, which is suitable for scenarios that require long-term reliance on natural language modeling.

This dataset was released by Salesforce Research in 2016, with the main publishers being Stephen Merity, Caiming Xiong, James Bradbury and Richard Socher. The related paper is "Pointer Sentinel Mixture Models".

WikiText Long Term Dependency Language Modeling Dataset Long Term Dependency Language Modeling Dataset