HyperAIHyperAI

Command Palette

Search for a command to run...

Sutra 10B Pretraining Teaching and Training Dataset

Date

4 hours ago

License

Apache 2.0

Sutra 10B Pretraining is a high-quality teaching dataset for pretraining large language models. Generated by the Sutra framework, it creates structured educational content and optimizes the pretraining of language models. This is the largest dataset in the Sutra series, designed to demonstrate how dense, well-curated datasets can provide optimal pretraining performance for small language models.

This dataset contains 10,193,029 teaching records, totaling over 10 billion tokens, covering nine major areas: interdisciplinary, technology, science, social studies, mathematics, life skills, arts and creativity, language arts, and philosophy and ethics. The data follows a well-established teaching paradigm, with 10 levels of difficulty from basic to advanced, demonstrating good hierarchy and systematic organization.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp