
The DeepSeek AI model’s first peer-reviewed paper demonstrates how a Chinese start-up company created the game-changing LLM for $300,000.
According to experts at the Chinese company, DeepSeek’s potent artificial intelligence (AI) model R1, which, when it was published in January, caused the US stock market to plunge, did not succeed because it was trained on the work of its competitors. Documents containing the declaration were made public today in Nature along with a peer-reviewed version of the R1 model.
Designed to perform very well at “reasoning” skills like coding and mathematics, R1 is a less expensive alternative to tools created by US technology companies. Being a “open weight” model, it may be downloaded by anybody and has received 10.9 million downloads, making it the most downloaded model on the Hugging Face AI community platform to date.
The study updates a preprint from January that outlines how DeepSeek improved a typical large language model (LLM) to handle reasoning tasks. Its additional information discloses for the first time how much R1 cost to train: merely $294,000. This is on top of the $6 million or more that the business, located in Hangzhou, spent to create the basic LLM on which R1 is built, but the overall expense is still far less than the tens of millions of dollars that competing models are said to have cost. According to DeepSeek, R1 was mostly trained on Nvidia’s H800 processors, which are prohibited from being exported to China under US export rules beginning in 2023.
Thorough evaluation
R1 is believed to be the first significant LLM to go through the peer review procedure. According to Hugging Face machine-learning engineer Lewis Tunstall, “this is a very welcome precedent.” It becomes quite difficult to determine whether these technologies are risky if we don’t have the custom of disclosing a significant portion of this process to the public.
In response to feedback from peer review, the DeepSeek team clarified technical information, such as the kind of data the model was trained on and its safety, and lessened anthropomorphizing in their explanations. “Verifying the validity and usefulness of the model through a rigorous peer-review process is definitely helpful,” says Huan Sun, an AI researcher at Ohio State University in Columbus. “More companies ought to follow suit.”
In order to generate R1, DeepSeek’s primary invention was the automated version of the trial-and-error method called pure reinforcement learning. Instead than teaching the model to mimic human-selected reasoning instances, the procedure rewarded the model for arriving at accurate responses. In this way, the business claims, its model learned its own reasoning-like techniques, including how to validate its operations without resorting to human-prescribed procedures. Group relative policy optimization is a strategy that the model used to score its own efforts using estimates instead of a separate algorithm in order to increase efficiency.
AI researchers have found the model to be “quite influential,” according to Sun. “Almost every piece of 2025 research that uses reinforcement learning in LLMs may have been influenced in some way by R1.”
Training technique
Researchers at OpenAI, the San Francisco, California-based company that developed ChatGPT and the ‘o’ series of reasoning models, reportedly believed DeepSeek had trained R1 using outputs from OpenAI models, which could have increased a model’s capabilities while consuming fewer resources, according to media reports in January.
DeepSeek’s training data is not included in the publication. Nevertheless, the researchers from the company claimed in discussions with referees that R1 did not pick up knowledge by replicating reasoning instances produced by OpenAI models. They did admit, though, that R1’s basic model was trained online, just like the majority of other LLMs, so it will have consumed any available AI-generated information.
R1 is very competitive for researchers, according to Sun. Although R1 did not rank top in accuracy, Sun and colleagues discovered that it was among the best models in terms of ability and cost balance in the ScienceAgentBench competition, which aims to fulfill scientific tasks including data analysis and visualization.
The techniques used to develop R1 are currently being used by other academics to enhance the reasoning-like capabilities of current LLMs and expand them to fields other than coding and mathematics, according to Tunstall. R1 has so “kick-started a revolution,” he continues.






