Proof-Pile-2 Mathematical Dataset
Date
a year ago
Size
47.57 GB
Publish URL
Categories

Proof-Pile-2 is a tokenized dataset of 55 billion math and science documents. It is a blend of scientific papers, math-related web content, and math code up to date as of April 2023 (excluding a specific subset of Lean proof steps). This dataset was created to train Llemma 7B and Llemma 34B models.
It consists of three subsets:
arxiv
(29B tokens): RedPajama's ArXiv subsetopen-web-math
(15B tokens):OpenWebMath A dataset containing many high-quality mathematical texts from the Internet.algebraic-stack
(11B tokens): A new dataset of mathematical codes covering numerical computing, computer algebra, and formal mathematics.
proof-pile-2.torrent
Seeding 1Downloading 2Completed 82Total Downloads 151