Wikipedia Wikipedia Dataset
Date
Size
Publish URL
License
CC BY-NC-SA 3.0
Categories
Dataset Summary
The Wikipedia dataset contains cleaned articles in all languages.
This dataset is provided by Wikipedia dumps Build, with one subset per language and each subset concatenated with a column split.
Each example contains the content of a complete Wikipedia article, cleaned up to remove markup and unwanted parts (like "references", etc.).
Data Visualization
Click Nomic Atlas Map,visualizing 6.4 million samples of the 20231101.en split.
Licensing Information
Copyright License Information:https://dumps.wikimedia.org/legal.html
All original text content is based on GNU Free Documentation License (GFDL) andCreative Commons Attribution-Share Alike 3.0 LicenseLicense. Some text may be available only under a Creative Commons license; see theirterms of use. Some texts written by authors may be released under additional licenses or enter the public domain.