Dataset segmentation

Number of instances in the split

Train

287,113

Validation

13,368

Test

11,490

Dataset creation

Creation History

Version 1.0.0 aims to support supervised neural methods for machine reading and question answering using large amounts of real natural language training data, and released about 313,000 unique articles and nearly 1 million cloze questions that accompany the articles. Versions 2.0.0 and 3.0.0 changed the structure of the dataset to support summarization instead of question answering. Version 3.0.0 provides a non-anonymized version of the data, while the previous two versions were pre-processed to replace named entities with unique identifier tags.

Source Data

Initial data collection and normalization

The data consists of news articles and highlighted sentences. In the question-answering setting of the data, the articles are used as context and entities are hidden in the highlighted sentences one by one, generating cloze-style questions where the goal of the model is to correctly guess which entity in the context has been hidden in the highlighting. In the summarization setting, the highlighted sentences are concatenated to form a summary of the article. The CNN articles were written between April 2007 and April 2015. The DailyMail articles were written between June 2010 and April 2015.

The code for the original data collection is available at https://github.com/deepmind/rc-data The article was retrieved using the Wayback Machine www.cnn.com>andwww.dailymail.co.uk> Archive downloads. If an article exceeds 2,000 tags, it will not be included in the version 1.0.0 collection.

HyperAI

Download

Discuss on Discord

Date

2 years ago

Size

503.3 MB

Organization

Publish URL

www.kaggle.com

Data Fields

id: A string containing the SHA1 hash in hexadecimal format of the URL to retrieve the story from
article: A string containing the body of the news article
highlights: A string containing the article highlights written by the article author

Data segmentation

The CNN/DailyMail dataset is divided into 3 parts: training, validation, and testing. The following are the statistics of version 3.0.0 of the dataset.

Dataset segmentation	Number of instances in the split
Train	287,113
Validation	13,368
Test	11,490

Dataset creation

Creation History

Source Data

Initial data collection and normalization

CNN-DailyMail-newspaper.torrent

Seeding 1Downloading 0Completed 197Total Downloads 413

CNN-DailyMail-newspaper/
- README.md
  2.79 KB
- README.txt
  5.57 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Dataset segmentation

Number of instances in the split

Train

287,113

Validation

13,368

Test

11,490

Dataset creation

Creation History

Source Data

Initial data collection and normalization

Command Palette

CNN/DailyMail News Articles Dataset

Data Fields

Data segmentation

Dataset creation

Creation History

Source Data

Initial data collection and normalization

Build AI with AI

HyperAI Newsletters

Command Palette

CNN/DailyMail News Articles Dataset

Data Fields

Data segmentation

Dataset creation

Creation History

Source Data

Initial data collection and normalization

Related Datasets

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

Patient Churn Prediction Dataset

RealTimeFaceSwap-10k Video Call Spoofing Dataset

Patient Segmentation Dataset

Delhi Pollution AQI Dataset

Global Green Energy Pulse Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

Sonar Signal Underwater Sonar Signal Dataset

Diabetes Mexico (Mexico Diabetes Dataset)

Build AI with AI

HyperAI Newsletters

Command Palette

CNN/DailyMail News Articles Dataset

Data Fields

Data segmentation

Dataset creation

Creation History

Source Data

Initial data collection and normalization

Related Datasets

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

Patient Churn Prediction Dataset

RealTimeFaceSwap-10k Video Call Spoofing Dataset

Patient Segmentation Dataset

Delhi Pollution AQI Dataset

Global Green Energy Pulse Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

Sonar Signal Underwater Sonar Signal Dataset

Diabetes Mexico (Mexico Diabetes Dataset)

Build AI with AI

HyperAI Newsletters

Related Datasets

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

Patient Churn Prediction Dataset

RealTimeFaceSwap-10k Video Call Spoofing Dataset

Patient Segmentation Dataset

Delhi Pollution AQI Dataset

Global Green Energy Pulse Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

Sonar Signal Underwater Sonar Signal Dataset

Diabetes Mexico (Mexico Diabetes Dataset)

Related Datasets

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

Patient Churn Prediction Dataset

RealTimeFaceSwap-10k Video Call Spoofing Dataset

Patient Segmentation Dataset

Delhi Pollution AQI Dataset

Global Green Energy Pulse Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

Sonar Signal Underwater Sonar Signal Dataset

Diabetes Mexico (Mexico Diabetes Dataset)