CNN/DailyMail News Articles Dataset
Date
Size
Publish URL
Categories
The dataset contains more than 300,000 unique news articles written by CNN and Daily Mail journalists. The current version supports extractive and abstractive summarization, but the original version was created for machine reading and comprehension and abstractive question answering. The purpose of this dataset is to help develop models that can summarize long paragraphs of text in one or two sentences, a task that is very useful for efficiently presenting information from large amounts of text.
Data Fields
id
: A string containing the SHA1 hash in hexadecimal format of the URL to retrieve the story fromarticle
: A string containing the body of the news articlehighlights
: A string containing the article highlights written by the article author
Data segmentation
The CNN/DailyMail dataset is divided into 3 parts: training, validation, and testing. The following are the statistics of version 3.0.0 of the dataset.
Dataset segmentation | Number of instances in the split |
---|---|
Train | 287,113 |
Validation | 13,368 |
Test | 11,490 |
Dataset creation
Creation History
Version 1.0.0 aims to support supervised neural methods for machine reading and question answering using large amounts of real natural language training data, and released about 313,000 unique articles and nearly 1 million cloze questions that accompany the articles. Versions 2.0.0 and 3.0.0 changed the structure of the dataset to support summarization instead of question answering. Version 3.0.0 provides a non-anonymized version of the data, while the previous two versions were pre-processed to replace named entities with unique identifier tags.
Source Data
Initial data collection and normalization
The data consists of news articles and highlighted sentences. In the question-answering setting of the data, the articles are used as context and entities are hidden in the highlighted sentences one by one, generating cloze-style questions where the goal of the model is to correctly guess which entity in the context has been hidden in the highlighting. In the summarization setting, the highlighted sentences are concatenated to form a summary of the article. The CNN articles were written between April 2007 and April 2015. The DailyMail articles were written between June 2010 and April 2015.
The code for the original data collection is available at https://github.com/deepmind/rc-data The article was retrieved using the Wayback Machine www.cnn.com>andwww.dailymail.co.uk> Archive downloads. If an article exceeds 2,000 tags, it will not be included in the version 1.0.0 collection.