How DeepSeek ripped up the AI playbook—and why everyone’s going to follow its lead
And on the hardware side, DeepSeek has found new ways to juice old chips, allowing it to train top-tier models without coughing up for the latest hardware on the market. Half their innovation comes from straight engineering, says Zeiler: “They definitely have some really, really good GPU engineers on that team.”
Nvidia provides software called CUDA that engineers use to tweak the settings of their chips. But DeepSeek bypassed this code using assembler, a programming language that talks to the hardware itself, to go far beyond what Nvidia offers out of the box. “That’s as hardcore as it gets in optimizing these things,” says Zeiler. “You can do it, but basically it’s so difficult that nobody does.”
DeepSeek’s string of innovations on multiple models is impressive. But it also shows that the firm’s claim to have spent less than $6 million to train V3 is not the whole story. R1 and V3 were built on a stack of existing tech. “Maybe the very last step—the last click of the button—cost them $6 million, but the research that led up to that probably cost 10 times as much, if not more,” says Friedman. And in a blog post that cut through a lot of the hype, Anthropic cofounder and CEO Dario Amodei pointed out that DeepSeek probably has around $1 billion worth of chips, an estimate based on reports that the firm in fact used 50,000 Nvidia H100 GPUs.
A new paradigm
But why now? There are hundreds of startups around the world trying to build the next big thing. Why have we seen a string of reasoning models like OpenAI’s o1 and o3, Google DeepMind’s Gemini 2.0 Flash Thinking, and now R1 appear within weeks of each other?
The answer is that the base models—GPT-4o, Gemini 2.0, V3—are all now good enough to have reasoning-like behavior coaxed out of them. “What R1 shows is that with a strong enough base model, reinforcement learning is sufficient to elicit reasoning from a language model without any human supervision,” says Lewis Tunstall, a scientist at Hugging Face.
In other words, top US firms may have figured out how to do it but were keeping quiet. “It seems that there’s a clever way of taking your base model, your pretrained model, and turning it into a much more capable reasoning model,” says Zeiler. “And up to this point, the procedure that was required for converting a pretrained model into a reasoning model wasn’t well known. It wasn’t public.”
What’s different about R1 is that DeepSeek published how they did it. “And it turns out that it’s not that expensive a process,” says Zeiler. “The hard part is getting that pretrained model in the first place.” As Karpathy revealed at Microsoft Build last year, pretraining a model represents 99% of the work and most of the cost.
If building reasoning models is not as hard as people thought, we can expect a proliferation of free models that are far more capable than we’ve yet seen. With the know-how out in the open, Friedman thinks, there will be more collaboration between small companies, blunting the edge that the biggest companies have enjoyed. “I think this could be a monumental moment,” he says.