Command Palette
Search for a command to run...
Sutra 10B Pretraining Teaching and Training Dataset
Sutra 10B Pretraining is a high-quality teaching dataset for pretraining large language models. Generated by the Sutra framework, it creates structured educational content and optimizes the pretraining of language models. This is the largest dataset in the Sutra series, designed to demonstrate how dense, well-curated datasets can provide optimal pretraining performance for small language models.
This dataset contains 10,193,029 teaching records, totaling over 10 billion tokens, covering nine major areas: interdisciplinary, technology, science, social studies, mathematics, life skills, arts and creativity, language arts, and philosophy and ethics. The data follows a well-established teaching paradigm, with 10 levels of difficulty from basic to advanced, demonstrating good hierarchy and systematic organization.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.