MiniMax-M1: World's First Open-Weight Hybrid-Attention Model Outshines Competitors in Long-Context Tasks
GitHub - MiniMax-AI/MiniMax-M1: MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. Model Overview We are proud to introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 leverages a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning-fast attention mechanism. This model builds upon our previous MiniMax-Text-01, which boasts 456 billion parameters with 45.9 billion activated per token. MiniMax-M1 maintains the same context length of 1 million tokens, eight times that of DeepSeek R1, making it highly effective for tasks requiring extensive input processing. The lightning attention mechanism in MiniMax-M1 significantly reduces computational requirements. For instance, at a generation length of 100,000 tokens, MiniMax-M1 consumes only 25% of the FLOPs compared to DeepSeek R1. This efficiency is crucial for handling complex tasks like software engineering, tool usage, and long-context reasoning. Training Methodology MiniMax-M1 is trained using large-scale reinforcement learning (RL) on a wide range of problems, from traditional mathematics to real-world software engineering scenarios. To enhance training efficiency, we developed an innovative RL scaling framework with two key components: CISPO (Clipped Importance Sampling Weight Optimization): Unlike conventional methods that clip token updates, CISPO clips importance sampling weights. This approach outperforms other RL variants and ensures better scalability. Hybrid-Attention Design: The design naturally optimizes the efficiency of RL, addressing unique challenges associated with scaling the hybrid architecture. We have trained two versions of MiniMax-M1, each with different thinking budgets of 40K and 80K tokens, respectively. Both versions demonstrate superior performance on standard benchmarks, particularly excelling in complex software engineering, tool usage, and long-context tasks. Benchmark Performance Comparison Below is a comparison of MiniMax-M1’s performance against leading commercial and open-weight models across various categories. Mathematics AIME 2024: MiniMax-M1-80K (86.0), MiniMax-M1-40K (83.3), Qwen3-235B (85.7), DeepSeek-R1 (91.4), Seed-Thinking-v1.5 (79.8), Claude 4 Opus (76.0), Gemini 2.5 Pro (92.0), OpenAI-o3 (91.6) AIME 2025: MiniMax-M1-80K (76.9), MiniMax-M1-40K (74.6), Qwen3-235B (81.5), DeepSeek-R1 (87.5), Seed-Thinking-v1.5 (70.0), Claude 4 Opus (75.5), Gemini 2.5 Pro (88.0), OpenAI-o3 (88.9) MATH-500: MiniMax-M1-80K (96.8), MiniMax-M1-40K (96.0), Qwen3-235B (96.2), DeepSeek-R1 (98.0), Seed-Thinking-v1.5 (97.3), Claude 4 Opus (96.7), Gemini 2.5 Pro (98.8), OpenAI-o3 (98.1) General Coding LiveCodeBench (24/8~25/5): MiniMax-M1-80K (65.0), MiniMax-M1-40K (62.3), Qwen3-235B (65.9), DeepSeek-R1 (73.1), Seed-Thinking-v1.5 (55.9), Claude 4 Opus (56.6), Gemini 2.5 Pro (77.1), OpenAI-o3 (75.8) FullStackBench: MiniMax-M1-80K (68.3), MiniMax-M1-40K (67.6), Qwen3-235B (62.9), DeepSeek-R1 (69.4), Seed-Thinking-v1.5 (70.1), Claude 4 Opus (70.3), Gemini 2.5 Pro (69.3) Reasoning & Knowledge GPQA Diamond: MiniMax-M1-80K (70.0), MiniMax-M1-40K (69.2), Qwen3-235B (71.1), DeepSeek-R1 (81.0), Seed-Thinking-v1.5 (71.5), Claude 4 Opus (79.6), Gemini 2.5 Pro (86.4), OpenAI-o3 (83.3) HLE (no tools): MiniMax-M1-80K (8.4), MiniMax-M1-40K (7.2), Qwen3-235B (7.6), DeepSeek-R1 (17.7), Seed-Thinking-v1.5 (8.6*), Claude 4 Opus (10.7), Gemini 2.5 Pro (21.6), OpenAI-o3 (20.3) ZebraLogic: MiniMax-M1-80K (86.8), MiniMax-M1-40K (80.1), Qwen3-235B (80.3), DeepSeek-R1 (95.1), Seed-Thinking-v1.5 (78.7), Claude 4 Opus (95.1), Gemini 2.5 Pro (91.6), OpenAI-o3 (95.8) MMLU-Pro: MiniMax-M1-80K (81.1), MiniMax-M1-40K (80.6), Qwen3-235B (83.0), DeepSeek-R1 (85.0), Seed-Thinking-v1.5 (84.0), Claude 4 Opus (87.0), Gemini 2.5 Pro (86.0), OpenAI-o3 (85.0) Software Engineering SWE-bench Verified: MiniMax-M1-80K (56.0), MiniMax-M1-40K (55.6), Qwen3-235B (34.4), DeepSeek-R1 (57.6), Seed-Thinking-v1.5 (49.2), Claude 4 Opus (72.5), Gemini 2.5 Pro (67.2), OpenAI-o3 (69.1) Long Context OpenAI-MRCR (128k): MiniMax-M1-80K (73.4), MiniMax-M1-40K (76.1), Qwen3-235B (27.7), DeepSeek-R1 (51.5), Seed-Thinking-v1.5 (35.8), Claude 4 Opus (54.3), Gemini 2.5 Pro (76.8), OpenAI-o3 (56.5) OpenAI-MRCR (1M): MiniMax-M1-80K (56.2), MiniMax-M1-40K (58.6), Qwen3-235B (—), DeepSeek-R1 (—), Seed-Thinking-v1.5 (—), Claude 4 Opus (—), Gemini 2.5 Pro (58.8), OpenAI-o3 (—) LongBench-v2: MiniMax-M1-80K (61.5), MiniMax-M1-40K (61.0), Qwen3-235B (50.1), DeepSeek-R1 (52.1), Seed-Thinking-v1.5 (58.3), Claude 4 Opus (52.5), Gemini 2.5 Pro (65.0), OpenAI-o3 (58.8) Agentic Tool Use TAU-bench (airline): MiniMax-M1-80K (62.0), MiniMax-M1-40K (60.0), Qwen3-235B (34.7), DeepSeek-R1 (53.5), Seed-Thinking-v1.5 (—), Claude 4 Opus (59.6), Gemini 2.5 Pro (50.0), OpenAI-o3 (52.0) TAU-bench (retail): MiniMax-M1-80K (63.5), MiniMax-M1-40K (67.8), Qwen3-235B (58.6), DeepSeek-R1 (63.9), Seed-Thinking-v1.5 (—), Claude 4 Opus (81.4), Gemini 2.5 Pro (67.0), OpenAI-o3 (73.9) Factuality SimpleQA: MiniMax-M1-80K (18.5), MiniMax-M1-40K (17.9), Qwen3-235B (11.0), DeepSeek-R1 (27.8), Seed-Thinking-v1.5 (30.1), Claude 4 Opus (—), Gemini 2.5 Pro (54.0), OpenAI-o3 (49.4) General Assistant MultiChallenge: MiniMax-M1-80K (44.7), MiniMax-M1-40K (44.7), Qwen3-235B (40.0), DeepSeek-R1 (45.0), Seed-Thinking-v1.5 (40.7), Claude 4 Opus (45.8), Gemini 2.5 Pro (51.8), OpenAI-o3 (56.5) Evaluated on the text-only subset of HLE. SWE-bench Methodology Our results for SWE-bench are derived from an agentless scaffold. Our methodology includes a two-stage localization process: initial coarse-grained file localization followed by fine-grained localization to specific files and code elements. Scores are calculated on a subset of 486 verified tasks compatible with our infrastructure. Excluded tasks, due to incompatibility, are listed in our documentation. TAU-bench Methodology We evaluate TAU-Bench using GPT-4.1 as the user model and without custom tools. The maximum number of interaction steps is set to 40. Our general system prompt is included in the evaluation documentation. Deployment Guide To access MiniMax-M1, download the model from the HuggingFace repository. For optimal performance in production, we recommend using vLLM, which offers several advantages for serving large language models. Detailed deployment instructions are available in our vLLM Deployment Guide. Alternatively, you can deploy the model using Transformers. Instructions for this method can be found in our MiniMax-M1 Transformers Deployment Guide. Function Calling MiniMax-M1 supports function calling, allowing the model to recognize when external functions should be invoked and to output function call parameters in a structured format. The MiniMax-M1 Function Call Guide provides comprehensive instructions for utilizing this feature. Chatbot & API For general use and evaluation, we offer a chatbot with online search capabilities and an online API for developers. Additionally, the MiniMax M1 Control Protocol (MCP) Server supports advanced functionalities such as video generation, image creation, speech synthesis, and voice cloning. These tools are designed to enhance the versatility and application scope of MiniMax-M1. Contact Us For further inquiries or assistance, please contact us at model@minimax.io.