HyperAI

Wikipedia Wikipedia Dataset

Date

a year ago

Size

57.98 GB

Organization

Publish URL

huggingface.co

License

CC BY-NC-SA 3.0

Categories

Dataset Summary

The Wikipedia dataset contains cleaned articles in all languages.

This dataset is provided by Wikipedia dumps  Build, with one subset per language and each subset concatenated with a column split.

Each example contains the content of a complete Wikipedia article, cleaned up to remove markup and unwanted parts (like "references", etc.).

Data Visualization

Click Nomic Atlas  Map,visualizing 6.4 million samples of the 20231101.en split.

Licensing Information

Copyright License Information:https://dumps.wikimedia.org/legal.html

All original text content is based on GNU Free Documentation License (GFDL) andCreative Commons Attribution-Share Alike 3.0 LicenseLicense. Some text may be available only under a Creative Commons license; see theirterms of use. Some texts written by authors may be released under additional licenses or enter the public domain.

wikipedia.torrent
Seeding 1Downloading 2Completed 181Total Downloads 453
  • wikipedia/
    • README.md
      1.54 KB
    • README.txt
      3.09 KB
      • data/
        • wikipedia.zip
          57.98 GB