HyperAIHyperAI

Command Palette

Search for a command to run...

Proof-Pile-2 Mathematical Dataset

Date

2 years ago

Size

47.57 GB

Organization

Princeton University
Featured Image

Proof-Pile-2 is a tokenized dataset of 55 billion math and science documents. It is a blend of scientific papers, math-related web content, and math code up to date as of April 2023 (excluding a specific subset of Lean proof steps). This dataset was created to train Llemma 7B and Llemma 34B models.

It consists of three subsets:

  • arxiv (29B tokens): RedPajama's ArXiv subset
  • open-web-math (15B tokens):OpenWebMath A dataset containing many high-quality mathematical texts from the Internet.
  • algebraic-stack (11B tokens): A new dataset of mathematical codes covering numerical computing, computer algebra, and formal mathematics.
proof-pile-2.torrent
Seeding 2Downloading 0Completed 151Total Downloads 277
  • proof-pile-2/
    • README.md
      1.37 KB
    • README.txt
      2.73 KB
      • data/
        • proof-pile-2.zip
          47.57 GB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Proof-Pile-2 Mathematical Dataset | Datasets | HyperAI