HyperAI

CNN/DailyMail News Articles Dataset

Date

9 months ago

Size

503.3 MB

Organization

Kaggle

Publish URL

www.kaggle.com

The dataset contains more than 300,000 unique news articles written by CNN and Daily Mail journalists. The current version supports extractive and abstractive summarization, but the original version was created for machine reading and comprehension and abstractive question answering. The purpose of this dataset is to help develop models that can summarize long paragraphs of text in one or two sentences, a task that is very useful for efficiently presenting information from large amounts of text.

Data Fields

  • id: A string containing the SHA1 hash in hexadecimal format of the URL to retrieve the story from
  • article: A string containing the body of the news article
  • highlights: A string containing the article highlights written by the article author

Data segmentation

The CNN/DailyMail dataset is divided into 3 parts: training, validation, and testing. The following are the statistics of version 3.0.0 of the dataset.

Dataset segmentationNumber of instances in the split
Train287,113
Validation13,368
Test11,490

Dataset creation

Creation History

Version 1.0.0 aims to support supervised neural methods for machine reading and question answering using large amounts of real natural language training data, and released about 313,000 unique articles and nearly 1 million cloze questions that accompany the articles. Versions 2.0.0 and 3.0.0 changed the structure of the dataset to support summarization instead of question answering. Version 3.0.0 provides a non-anonymized version of the data, while the previous two versions were pre-processed to replace named entities with unique identifier tags.

Source Data

Initial data collection and normalization

The data consists of news articles and highlighted sentences. In the question-answering setting of the data, the articles are used as context and entities are hidden in the highlighted sentences one by one, generating cloze-style questions where the goal of the model is to correctly guess which entity in the context has been hidden in the highlighting. In the summarization setting, the highlighted sentences are concatenated to form a summary of the article. The CNN articles were written between April 2007 and April 2015. The DailyMail articles were written between June 2010 and April 2015.

The code for the original data collection is available at https://github.com/deepmind/rc-data The article was retrieved using the Wayback Machine www.cnn.com>andwww.dailymail.co.uk> Archive downloads. If an article exceeds 2,000 tags, it will not be included in the version 1.0.0 collection.

CNN-DailyMail-newspaper.torrent
Seeding 1Downloading 1Completed 105Total Downloads 131
  • CNN-DailyMail-newspaper/
    • README.md
      2.79 KB
    • README.txt
      5.57 KB
      • data/
        • newspaper.zip
          503.3 MB