Domain-Adapted LLM Accelerates Battery Research with Advanced Reasoning Capabilities
Scientific research, especially in complex fields like battery innovation, often moves at a snail's pace due to the labor-intensive nature of manual material evaluation. To accelerate this process, SES AI has developed the Molecular Universe LLM, a 70B parameter scientific model enhanced with reasoning capabilities. This model, built on NVIDIA’s ecosystem, showcases how domain-adapted large language models (LLMs) can drive breakthroughs in scientific innovation. The Challenge and the Solution General-purpose LLMs, while powerful, lack the specialized knowledge necessary to excel in niche scientific domains. Domain-adaptive pretraining (DAPT) addresses this by extending existing foundation models with custom, domain-relevant datasets. SES AI’s Molecular Universe LLM, derived from the LLaMA 3.1 70B model, demonstrates how this approach can transform battery research. Training Pipeline Infrastructure Setup The Molecular Universe LLM was trained using NVIDIA DGX Cloud, a fully managed AI training platform equipped with 128 NVIDIA H100 GPUs. DGX Cloud simplifies the setup and management of large-scale training, ensuring high GPU utilization and minimizing the time to value. The NVIDIA NeMo Framework, used for AI development, optimized the training process with 4D parallelism, mixed precision techniques, and flash attention, making the model computationally efficient and capable of handling long sequences. Step 1: Continuous Pretraining To imbue the model with deep domain expertise, continuous pretraining was conducted using a vast corpus of 19 million scientific papers. NeMo Curator played a crucial role in data curation, converting PDFs to plain text and applying advanced filtering techniques to reduce redundancy and maintain data quality. This resulted in 17 million unique, high-quality records. The model was trained with an input sequence length of 8,192 tokens, processing 524,288 tokens per forward pass. The training took 144 hours using bfloat16 precision, with domain-adaptive pretraining being highly efficient, requiring only about 1.5% of the total pretraining compute. Step 2: Supervised Fine-Tuning (SFT) To improve the model's ability to follow instructions and generate task-specific responses, supervised fine-tuning (SFT) was employed. SES AI used NVIDIA Llama 3.1 70B NIM to generate a high-quality SFT dataset, consisting of 250,000 samples, including 200,000 instruction samples and 90,000 general chat samples. The dataset was tokenized and fine-tuned using the NeMo Framework on NVIDIA DGX Cloud, with the process completing in just 32 hours. The training and validation loss profiles showed a rapid initial decline, stabilizing over time, indicating effective learning without overfitting. Step 3: High-Quality Reasoning Post-Training Even with domain-specific pretraining and SFT, models struggle with complex scientific problems requiring multi-step reasoning. To enhance this capability, the Molecular Universe Chat model was further fine-tuned using a curated set of 25,000 samples from s1K Reasoning Data. This dataset was filtered to include high-quality, difficult questions and was used to increase the context length to 16,000 tokens. The post-training process took approximately 12 hours on 64 H100 GPUs, significantly improving the model’s reasoning and factual accuracy. Results The Molecular Universe LLM was evaluated on both public and custom battery-specific benchmarks. On the GPQA Diamond, it scored 0.72, outperforming most other similar models and even larger models like DeepSeek-R1. The model also excelled in battery-specific tasks such as question-answering (Q&A), multiple-choice questions (MCQ), reading comprehension, summarization, and reasoning. Despite some minor performance gaps in certain areas, the Molecular Universe LLM delivered competitive results with far fewer parameters and lower training costs compared to GPT-o1. Impact and Future Work SES AI’s Molecular Universe LLM revolutionizes battery research by automating the evaluation of electrolyte solvents and additives. This integration reduces the manual labor required from scientists, allowing them to evaluate thousands of candidates instead of dozens each day. The model will be integrated into SES AI’s materials discovery platform, Molecular Universe (MU-0), providing researchers with a consolidated search interface for exploring vast databases of candidate small molecules. Future efforts will focus on refining the model’s reasoning capabilities through specialized battery-focused datasets and exploring reinforcement learning with human feedback to enhance performance further. This work underscores the potential of domain-adapted, mid-sized LLMs with reasoning capabilities, paving the way for more efficient and specialized models in various scientific domains. Industry Insights and Company Profile Industry insiders commend SES AI’s approach for its pragmatic and cost-effective use of existing technology to achieve significant gains in scientific productivity. SES AI, a leader in battery innovation, continues to push the boundaries of AI in materials science, demonstrating the transformative impact of domain-specific LLMs. NVIDIA’s NeMo Framework and DGX Cloud have been instrumental in this success, offering scalable solutions for training and deploying advanced models. For those interested in exploring the NeMo Framework and NVIDIA’s suite of tools, detailed documentation and resources are available on the official NVIDIA website.