Reuters-21578 Text Classification Dataset
Date
Size
Publish URL
Categories
Reuters – 21578 Dataset is a test collection for text classification research. It is a multi-class, multi-label dataset that is expected to be replaced by RCV1 in the next few years. The dataset has 90 classes, 7769 training files and 3019 test files. It is a ModApte subdirectory of the Reuters – 21578 benchmark.
Reuters – 21578 The dataset was originally collected and labeled by Carnegie Group and Reuters in 1987 during the development of the CONSTRUE text classification system. It was later released by AT&T Labs Research in September 1997. The main publisher was David D. Lewis. The related papers are:
"Automated Learning of Decision Rules for Text Categorization"
"Toward Language Independent Automated Learning of Text Categorization Models"
"TCS: A Shell for Content-Based Text Categorization"
"CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories"