With a Training Cost of $294,000, DeepSeek-R1 Was Featured on the Cover of Nature, Becoming the First Mainstream large-scale Model to Pass Peer Review in an Authoritative Journal and Receiving Positive reviews.

On September 17, the research results related to DeepSeek-R1 appeared on the cover of Nature, and this news quickly sparked heated discussions in the global academic community. In fact, the relevant research results were already published in the form of a preprint on arXiv in January this year.However, the significance of publishing this article in Nature is that it has been peer-reviewed by this authoritative journal.In other words, external experts are not just receiving one-way information, but are able to ask questions and request more information from the author team through a collaborative process under the supervision and management of an independent third party (editor), which is a first in the industry.
More importantly, unlike the preprint paper published in January that outlined the research methods and the performance of DeepSeek-R1 on a series of evaluation benchmarks, this formally published paper further disclosed the training cost of the model. According to a report from Nature News,The training cost of DeepSeek-R1 is only equivalent to US$294,000.Although DeepSeek has invested approximately $6 million in the underlying LLM on which the R1 model is based, the total cost is still far lower than the tens of millions of dollars generally believed in the industry to be required for head model training.
* Preprint address:
https://hyper.ai/cn/papers/2504.07128

DeepSeek stated that training DeepSeek-R1-Zero utilized 648 H800 GPUs, taking approximately 198 hours. Furthermore, training DeepSeek-R1 also utilized 648 H800 GPUs, taking approximately 4 days, or 80 hours. Building the SFT dataset also consumed approximately 5,000 GPU hours. The specific costs are shown in the figure above.
Large-scale reinforcement learning improves reasoning capabilities
The importance of large-scale model reasoning capabilities is self-evident and has become a key research direction in the industry. However, acquiring reasoning capabilities in the pre-training phase often requires enormous computing resources. In this regard, some studies have shown that LLM capabilities can be effectively enhanced through CoT (Chain-of-Thought) prompting, or that learning high-quality multi-step reasoning trajectories in the post-training phase can further improve performance. Although these methods are effective, they still have obvious limitations.For example, the reasoning process that relies on manual annotation reduces scalability and introduces cognitive biases.In addition, because the model is restricted to imitating the way humans think, its performance is essentially constrained by the examples provided by humans, and it is unable to explore better reasoning paths that go beyond human thinking patterns.
To address this, DeepSeek, based on DeepSeek-V3 Base8, adopted Group Relative Policy Optimization (GRPO) as its RL framework and skipped the traditional supervised fine-tuning (SFT) stage before RL training. This design choice stemmed from the team's assumptions:Artificially defined reasoning modes may limit model exploration, while unrestricted RL training can promote the emergence of new reasoning capabilities in LLM.
Based on this, the team developed DeepSeek-R1-Zero, which exhibits diverse and complex reasoning behaviors. To solve reasoning problems, the model tends to generate longer answers, incorporating verification, reflection, and exploration of different solutions into each answer. Although the team did not explicitly teach the model how to reason,But it still successfully learned a better reasoning strategy through RL.The research team used Group Relative Policy Optimization (GRPO), an algorithm originally proposed to simplify the training process and reduce the resource consumption of Proximal Policy Optimization (PPO). It does not require an evaluation model of the same size as the policy model, but directly estimates the baseline from the group score.
Furthermore, the team employed a rule-based reward system to calculate accuracy and format rewards. Building on GRPO and reward design, the team designed a template that requires DeepSeek-R1-Zero to first generate an inference process and then produce a final answer. During training, specific inference questions were used instead of prompts.

Specifically, after receiving a user's question, the model first outputs the reasoning process in the "Think" label, and then gives the final answer in the "Answer" label, so that it can autonomously explore effective reasoning paths in reinforcement learning.The research team used a rule-based reward system to evaluate the answers provided by DeepSeek-R1-Zero in the experiment, thereby ensuring the stability and scalability of the training process.
Evaluation results show that DeepSeek-R1-Zero's pass@1 score in the AIME 2024 mathematics competition has significantly improved from the initial 15.6% to 77.9%; if a self-consistent decoding strategy is adopted, the accuracy is further improved to 86.7%, exceeding the average level of human players.
In addition to mathematical tasks, the model also performed well in programming competitions and graduate-level biology, physics, and chemistry problems, fully verifying the effectiveness of reinforcement learning in improving the reasoning capabilities of large language models.

Furthermore, during reinforcement learning, DeepSeek-R1-Zero not only demonstrated progressively stronger reasoning capabilities with training, but also exhibited clear self-evolutionary characteristics. Experimental data showed that when the model is driven by intrinsic adaptation, its average inference length continuously increased during training and its inference path was continuously revised. It was able to proactively pause, review, and correct existing inference steps during the inference process, enabling reflective reasoning and the systematic exploration of alternative solutions.

Furthermore, to address challenges such as poor readability and language mixing, the research team developed DeepSeek-R1 to address the problems of poor readability and language confusion in DeepSeek-R1-Zero. Its workflow is as follows: * Based on DeepSeek-V3, conversational, human-thinking-consistent cold-start data is collected and input into DeepSeek-R1 Dev1; * DeepSeek-R1 Dev1 performs reinforcement learning and sampling based on the data, and DeepSeek-R1 Dev2 incorporates reasoning and non-reasoning datasets into the SFT process; * DeepSeek-R1 Dev3 promotes the second reinforcement learning stage to enhance the usefulness and harmlessness of the model, and finally outputs the answer to DeepSeek-R1.

From the experimental results, compared with DeepSeek-R1-Zero and DeepSeek-R1 Dev1, DeepSeek-R1 has significantly improved the instruction execution performance in each development stage and scored higher in the IF-Eval and Arena-Hard benchmarks.

The first large-scale model to pass peer review in a prestigious journal
As the first LLM model to undergo peer review, the DeepSeek-R1 research paper graced the cover of Nature upon publication. In the article "Bring Us Your LLms: Why Peer Review Is Good for AI Models," Nature noted that peer review is an effective countermeasure to marketing hype in the AI industry. Almost all mainstream large-scale AI models have yet to undergo independent peer review, a gap that "has finally been filled by DeepSeek."

In this regard, Subbarao Kanbhampati, a researcher at the University of Arizona and former president of AAAI, said that he participated in the peer review and believed that this was a good trend. He hoped to see more cutting-edge model developers follow their footsteps and share the technical details of AI model peer review.

Wind Info, a US technology media outlet, reported that compared to the initial version released in January, the paper reveals more details about the model training process and directly addresses the early distillation problem. It can be said that DeepSeek-R1 provides a model for more transparent and standardized AI research practices in the future.

References:
1. https://www.nature.com/articles/d41586-025-03015-6
2. https://www.nature.com/articles/d41586-025-02979-9
3. https://www.nature.com/articles/s41586-025-09422