HyperAI

Proof-Pile-2 Mathematical Dataset

Date

a year ago

Size

47.57 GB

Organization

Princeton University

Publish URL

huggingface.co

特色图像

Proof-Pile-2 is a tokenized dataset of 55 billion math and science documents. It is a blend of scientific papers, math-related web content, and math code up to date as of April 2023 (excluding a specific subset of Lean proof steps). This dataset was created to train Llemma 7B and Llemma 34B models.

It consists of three subsets:

  • arxiv (29B tokens): RedPajama's ArXiv subset
  • open-web-math (15B tokens):OpenWebMath A dataset containing many high-quality mathematical texts from the Internet.
  • algebraic-stack (11B tokens): A new dataset of mathematical codes covering numerical computing, computer algebra, and formal mathematics.
proof-pile-2.torrent
Seeding 1Downloading 2Completed 82Total Downloads 151
  • proof-pile-2/
    • README.md
      1.37 KB
    • README.txt
      2.73 KB
      • data/
        • proof-pile-2.zip
          47.57 GB