Date

2 months ago

License

Apache 2.0

Tags

Sutra 10B Pretraining is a high-quality teaching dataset for pretraining large language models. Generated by the Sutra framework, it creates structured educational content and optimizes the pretraining of language models. This is the largest dataset in the Sutra series, designed to demonstrate how dense, well-curated datasets can provide optimal pretraining performance for small language models. This dataset contains 10,193,029 teaching records, totaling over 10 billion tokens, covering nine major areas: interdisciplinary, technology, science, social studies, mathematics, life skills, arts and creativity, language arts, and philosophy and ethics. The data follows a well-established teaching paradigm, with 10 levels of difficulty from basic to advanced, demonstrating good hierarchy and systematic organization.

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Use this Dataset Discuss on Discord

Date

2 months ago

License

Apache 2.0

CHIMERA General Inference Synthetic Dataset

2 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Use this Dataset Discuss on Discord

Date

2 months ago

License

Apache 2.0

CHIMERA General Inference Synthetic Dataset

2 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Sutra 10B Pretraining Teaching and Training Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

Sutra 10B Pretraining Teaching and Training Dataset

CHIMERA General Inference Synthetic Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

Sutra 10B Pretraining Teaching and Training Dataset

CHIMERA General Inference Synthetic Dataset

Build AI with AI

HyperAI Newsletters

CHIMERA General Inference Synthetic Dataset

CHIMERA General Inference Synthetic Dataset