A professional investor who has worked as an analyst and software engineer wrote an article that was bearish on Nvidia, which was widely forwarded by Twitter celebrities and became a major "culprit" in the plunge of Nvidia's stock. Nvidia's market value evaporated by nearly $600 billion, which is the largest single-day drop for a specific listed company to date.
The main point of this Jeffrey Emanuel investor is nothing more than DeepSeek punctures the bullshit created by Wall Street, large technology companies and Nvidia, and Nvidia is overvalued. "Every investment bank recommends buying Nvidia, like a blind man pointing the way, and has no idea what he is talking about."
Jeffrey Emanuel said that Nvidia faces a much bumpier road than its valuation suggests in order to maintain its current growth trajectory and profit margins. There are five different directions to attack Nvidia - architectural innovation, customer vertical integration, software abstraction, efficiency breakthroughs and manufacturing democratization - and the probability that at least one of them will succeed and have a significant impact on Nvidia's profit margins or growth rates seems high. At the current valuation, the market has not taken these risks into account.
According to some industry investors, because of this report, Emanuel suddenly became a celebrity on Wall Street, and many hedge funds paid him $1,000 per hour, hoping to hear his views on Nvidia and AI. He was so busy that his throat was smoking, but his eyes were tired from counting money.
The following is the full report. Full reference study. As someone who has worked as an investment analyst for about 10 years at various long/short hedge funds (including stints at Millennium and Balyasny), and a math and computer geek who has been working on deep learning since 2010 (back when Geoff Hinton was still talking about restricted Boltzmann machines, everything was still programmed in MATLAB, and researchers were still trying to prove they could get better results in classifying handwritten digits than with support vector machines), I think I have a pretty unique perspective on the evolution of AI techniques and how they relate to stock market equity valuations. Over the past few years, I have been working more as a developer and have several popular open source projects that deal with various forms of AI models/services (e.g. see LLM Aided OCR, Swiss Army Llama, Fast Vector Similarity, Source to Prompt, and Pastel Inference Layer to name a few recent examples). Basically, I work intensively with these cutting-edge models on a daily basis. I have 3 Claude accounts so I don't run out of requests, and I signed up for ChatGPT Pro within minutes of it going live.
I also try to stay up to date on the latest research and read through all the major technical report papers published by the major AI labs. As a result, I think I have a pretty good understanding of the field and how things are developing. At the same time, I have been short a lot of stocks in my life and have won the Best Idea Award from the Value Investor Club twice (long TMS and short PDH if you've been paying attention).
I say this not to show off, but to prove that I can express an opinion on this issue without coming across as hopelessly naive to technical or professional investors. Of course, there are certainly many people who are more proficient in math/science than me, and many people who are more skilled in long/short investing in the stock market than me, but I think there are not many people who can be as middle of the Venn diagram as I am.
Nevertheless, whenever I meet up with friends and former colleagues in the hedge fund world, the topic quickly turns to Nvidia. It’s not every day that a company goes from obscurity to a market cap that’s larger than the UK, French, or German stock markets combined! These friends naturally want to know my thoughts on the subject. Because I’m such a firm believer in the long-term transformative impact this technology will have—I truly believe it will revolutionize every aspect of our economy and society over the next 5-10 years, and it’s basically unprecedented—it’s hard for me to call Nvidia’s momentum slow or stop anytime soon.
But even though I’ve thought valuations were too high for my taste for the past year or so, a series of recent developments have me leaning a bit toward my instinct to be more cautious about prospects and to question the consensus when it seems overpriced. The saying “wise men believe at the beginning, fools believe at the end” is famous for a reason.
Bull Case
Before we discuss the developments that have given me pause, let’s briefly recap the bull case for NVDA stock, which is now known to basically everyone. Deep learning and AI are the most transformative technologies since the internet and are expected to fundamentally change everything in our society. Nvidia is already in a position that is close to a monopoly in some way in terms of the portion of the industry’s total capital expenditures that are spent on training and inference infrastructure.
Some of the world’s largest and most profitable companies, such as Microsoft, Apple, Amazon, Meta, Google, Oracle, etc., have decided to do whatever it takes to stay competitive in this space because they simply cannot afford to fall behind. The amount of capital expenditures, electricity usage, square footage of new data centers built, and of course the number of GPUs, have all exploded and there seems to be no sign of slowing down. Nvidia is able to earn astonishing gross margins of up to 90%+ on its high-end products for data centers.
We’ve only scratched the surface of this bull market. There’s more going on right now that will make even the already very optimistic people even more optimistic. Besides the rise of humanoid robots (which I suspect will surprise most people when they can quickly do a lot of tasks that currently require unskilled (or even skilled) workers, like laundry, cleaning, tidying, and cooking; working in teams of workers to do construction work like remodeling bathrooms or building houses; managing warehouses and driving forklifts, etc.), there are other factors that most people haven’t even considered yet.
One of the main topics that smart people are talking about is the rise of the “new scaling laws,” which provide a new paradigm for thinking about how computing needs will grow over time. The original scaling law that has driven progress in AI since the advent of AlexNet in 2012 and the invention of the Transformer architecture in 2017 is the pre-training scaling law: the higher the value of the tokens we use as training data (now in the trillions), the greater the number of parameters of the models we train, and the more computational power (FLOPS) we expend training these models with these tokens, the better the final models will perform on a wide variety of very useful downstream tasks.
Not only that, this improvement is somewhat predictable, to the point that leading AI labs like OpenAI and Anthropic have a pretty good idea of how good their latest models will be before they even start actual training — in some cases, they can even predict the final model’s baseline value to within a few percentage points. This “original scaling law” is very important, but it always raises questions for those who use it to predict the future.
First, we seem to have exhausted the world’s accumulation of high-quality training datasets. Of course, this is not entirely true - there are still many old books and journals that have not been properly digitized, or even if they have been digitized, have not received the proper license to use as training data. The problem is that even if you give credit to all of this stuff - say the sum of "professionally" produced written content in English from 1500 to 2000, that's not a huge amount in percentage terms when you're talking about a training corpus of nearly 15 trillion tokens, the size of the current cutting-edge models.
To quickly check the reality of these numbers: Google Books has digitized about 40 million books so far; if an average book has 50,000 to 100,000 words, or 65,000 to 130,000 tokens, then books alone account for 2.6T to 5.2T of tokens, of course a large portion of which is already included in the training corpora used by large labs, whether strictly legal or not. There are also a lot of academic papers, with over 2 million papers on the arXiv site alone. The Library of Congress has over 3 billion pages of digitized newspapers. Added together, the total could be as high as 7T tokens, but since most of this is actually included in the training corpus, the remaining "incremental" training data may not be that important in the overall scheme of things.
Of course, there are other ways to collect more training data. For example, you could automatically transcribe every YouTube video and use that text. While this might help, it would certainly be of much lower quality than a well-respected organic chemistry textbook, which is a useful source of knowledge for understanding the world. So, in terms of the original law of scale, we are constantly facing the threat of a "data wall"; while we know that we can keep throwing more capital expenditures at GPUs and building more data centers, it is much more difficult to mass-produce useful new human knowledge that is a proper complement to existing knowledge. Now, an interesting response is the rise of "synthetic data", where the text itself is the output of the LLM. While it seems a bit ridiculous, "improving model quality through your own supply" does work very well in practice, at least in the fields of mathematics, logic, and computer programming.
Of course, the reason is that these are areas where we can mechanically check and prove the correctness of things. So we can sample from huge mathematical theorems or Python scripts and actually check whether they are correct, and only the correct data will be included in our database. In this way, we can greatly expand the set of high-quality training data, at least in these areas.
In addition to text, we can also use various other data to train AI. For example, what if we took the entire genome sequencing data of 100 million people (the uncompressed data size of one person is about 200GB to 300GB) and used it to train AI? This is obviously a lot of data, even though most of the data is almost exactly the same between two people. Of course, comparisons with text data from books and the internet can be misleading for a variety of reasons:
Raw genome size is not directly comparable to number of markers
The information content of genomic data is very different from text
The training value of highly redundant data is unclear
The computational requirements for processing genomic data are also different
But it’s still another huge source of information that we can train on in the future, which is why I included it.
So while we can hopefully get more and more additional training data, if you look at the rate at which training corpora have grown in recent years, it seems that we’ll soon hit a bottleneck in the availability of “generally useful” knowledge data, the kind that can help us get closer to our ultimate goal of obtaining an artificial superintelligence that is 10 times smarter than John von Neumann, a world-class expert in every specialized field known to man.
Besides the limited amount of data available, there are other concerns lurking in the minds of proponents of the law of pre-training scaling. One of them is, what do you do with all that computing infrastructure once you’ve finished training your model? Train the next model? Sure, you can do that, but given the rapid improvement in GPU speed and capacity, and the importance of electricity and other operating costs in economic computing, does it really make sense to use a 2-year-old cluster to train a new model? Surely you’d rather use that brand new data center you just built, which costs 10 times as much as the old one, and has 20 times the performance due to more advanced technology. The thing is, at some point you do need to amortize the upfront costs of these investments and recoup them through a (hopefully positive) operating profit stream, right?
The market has been so excited about AI that it has ignored this, allowing companies like OpenAI to accumulate operating losses from the beginning while earning higher and higher valuations on subsequent investments (of course, to their credit, they also show very rapidly growing revenues). But ultimately, to sustain this over a full market cycle, those datacenter costs need to eventually be recouped, and hopefully profitable, so that over time they can compete on a risk-adjusted basis with other investment opportunities.
New Paradigm
Okay, so that’s the law of pre-training scaling. So what is this “new” scaling law? Well, it’s something that people have only started to pay attention to in the past year: inference-time compute scaling. Before that, the vast majority of compute you spent in the process was the upfront training compute to create the model. Once you have a trained model, running inference on that model (i.e. asking a question or having the LLM do some kind of task for you) only uses a certain amount of compute.
Importantly, the total amount of inference compute (measured in various ways, such as FLOPS, GPU memory usage, etc.) is much less than the amount of compute required during the pre-training phase. Of course, the amount of inference compute does increase as you increase the size of the model’s context window and the amount of output it generates at once (although researchers have made amazing algorithmic improvements in this regard, and originally people expected it to scale quadratically). But basically, until recently, inference compute was typically much less intensive than training compute, and scaled roughly linearly with the number of requests being processed — for example, the more requests for ChatGPT text completion, the more inference compute it consumed.
With the advent of the revolutionary Chain-of-Thought (COT) models introduced last year, most notably in OpenAI’s flagship model O1 (but more recently in DeepSeek’s new model R1, which we’ll discuss in more detail later), everything changed. Instead of scaling inference compute directly to the length of the output text generated by the model (which would scale up for larger context windows, model sizes, etc.), these new COT models generate intermediate “logic tokens”; think of them as a kind of “temporary memory” or “inner monologue” while the model is trying to solve your problem or complete your assigned task.
This represents a real revolution in the way reasoning is done: now, the more tokens you use in this internal thought process, the better the final output quality you provide to the user. In effect, it's like giving a worker more time and resources to complete a task, so that they can check their work over and over again, complete the same basic task in multiple different ways and verify that the results are the same; "plug" the result into the formula to check if it actually solves the equation, etc.
It turns out that this approach works almost amazingly; it leverages the long-awaited power of "reinforcement learning", as well as the power of the Transformer architecture. It directly addresses one of the biggest weaknesses of the Transformer model, the tendency to "hallucinate".
Basically, the way Transformers work when predicting the next token at each step is that if they start going down a wrong "path" in their initial response, they become almost like a prevaricating child, trying to make up a story to explain why they are actually correct, even though they should use common sense along the way to realize that what they said can't possibly be correct.
Because the model is always trying to be internally consistent and make each successive generated token flow naturally from the previous tokens and context, it is difficult for them to course correct and backtrack. By breaking the reasoning process into many intermediate stages, they can try many different approaches, see which ones work, and keep trying course corrections and trying other approaches until they can reach a fairly high level of confidence that they are not talking nonsense.
What's special about this approach, besides the fact that it actually works, the more logic/COT tokens you use, the better it works. Suddenly, you have an extra dial, and as the number of COT inference tokens increases (which requires more inference computation, both in terms of floating point operations and memory), the higher the probability that you will give the right answer - that the code runs without errors the first time, or that the solution to a logic problem has no obviously wrong inference steps.
I can tell you from a lot of first-hand experience that, although Anthropic's Claude3.5 Sonnet model is very good at Python programming (it really is very good), it will always make one or more stupid mistakes whenever you need to generate any long and complex code. Now, these mistakes are usually easy to fix, and in fact, they can usually be fixed by just following the errors generated by the Python interpreter as follow-up reasoning hints (or, more practically, using a so-called linter to paste the complete set of "problems" that the code editor found in the code) without any further explanation. When the code gets very long or very complex, it sometimes takes longer to fix, and may even require some manual debugging.
The first time I tried OpenAI’s O1 model, it was like a revelation: I was amazed at how well the code worked the first time. That’s because the COT process automatically finds and fixes problems before the final response token in the answer the model gives.
In fact, the O1 model used in OpenAI’s ChatGPT Plus subscription service ($20 per month) is essentially the same model used in the O1-Pro model in the new ChatGPT Pro subscription service (which costs 10 times as much, or $200 per month, and has caused an uproar in the developer community); the main differences are that O1-Pro thinks longer before responding, generates more COT logic tokens, and consumes a lot of inference computing resources for each response.
This is quite remarkable, because even for Claude3.5 Sonnet or GPT4o, even given ~400kb+ of context, a very long and complex prompt typically takes less than 10 seconds to start responding, and often less than 5 seconds. The same prompt for the O1-Pro might take more than 5 minutes to get a response (although OpenAI does show you some of the "reasoning steps" it generates along the way while you wait; importantly, OpenAI has decided to hide the exact reasoning markup it generates from you for trade secret-related reasons, and instead show you a highly simplified summary).
As you might imagine, there are many situations where accuracy is critical - you'd rather give up and tell the user you simply can't do it than give an answer that could be easily proven wrong, or one that involves hallucinated facts or other specious reasoning. Anything involving money/transactions, medical, and legal, to name a few.
Basically, in cases where the cost of inference is negligible relative to the full hourly salary of the human knowledge worker interacting with the AI system, invoking the COT calculation becomes a complete no-brainer (the main drawback is that it will greatly increase the latency of the response, so in some cases you may prefer to iterate faster by getting a response with less latency and less accuracy or correctness).
A few weeks ago, there was some exciting news in the AI field involving OpenAI's unreleased O3 model, which was able to solve a range of problems that were previously thought to be unsolvable with existing AI methods in the short term. OpenAI was able to solve these most difficult problems (including extremely difficult "foundational" math problems that even very skilled professional mathematicians find difficult to solve) because OpenAI invested a lot of computing resources - in some cases, spending more than $3,000 in computing power to solve a task (by comparison, using a regular Transformer model, without thought chaining, the traditional cost of inference for a single task is unlikely to exceed a few dollars).
It doesn’t take an AI genius to realize that this progress has created a whole new scaling law that is completely different from the original pre-training scaling law. Now, you still want to train the best models by cleverly exploiting as many compute resources as possible and as many trillions of high-quality training data as possible, but this is just the beginning of the story of this new world; now you can easily use an incredible amount of compute resources to infer only from these models to very high confidence, or try to solve extremely difficult problems that require "genius-level" reasoning to avoid all the potential pitfalls that would lead an ordinary LLM astray.
But why should Nvidia have all the benefits to itself?
Even if you believe, as I do, that the future prospects of AI are almost unimaginable, the question remains: "Why should one company get most of the profits from this technology?" History has indeed had many important new technologies change the world, but the main winners were not the companies that looked the most promising in the initial stages. Even though the Wright brothers’ aircraft company invented and perfected the technology, today that company is worth less than $10 billion, even though it has morphed into multiple companies. While Ford has a respectable market cap of $40 billion today, that’s only 1.1% of Nvidia’s current market cap.
To understand this, you have to really understand why Nvidia has such a large market share. After all, they’re not the only company making GPUs. AMD makes decent GPUs that are comparable in terms of transistor count, process node, etc. to Nvidia’s. Sure, AMD GPUs aren’t as fast or advanced as Nvidia GPUs, but Nvidia GPUs aren’t 10x faster or anything like that. In fact, in terms of raw cost per FLOP, AMD GPUs are only half as expensive as Nvidia GPUs.
Looking at other semiconductor markets, such as the DRAM market, gross margins in the DRAM market are negative at the bottom of the cycle, around 60% at the top of the cycle, and average around 20%, despite the market being highly concentrated with only three global companies (Samsung, Micron, SK-Hynix). Nvidia, by contrast, has had overall gross margins of around 75% in recent quarters, dragged down primarily by the lower-margin and more commoditized consumer 3D graphics category.
So how is this possible? Well, the main reason has to do with software - well-tested and highly reliable drivers that "just work" on Linux (unlike AMD, whose Linux drivers are notorious for being low quality and unstable), and highly optimized open source code, such as PyTorch, that has been tweaked to run well on Nvidia GPUs.
Not only that, but CUDA, the programming framework that programmers use to write low-level code optimized for GPUs, is entirely owned by Nvidia and has become the de facto standard. If you wanted to hire a bunch of extremely talented programmers who knew how to accelerate their work with GPUs, and were willing to pay them $650,000/year, or whatever the going rate is for people with that particular skill set, then chances are they would “think” and work with CUDA.
Besides the software advantage, Nvidia’s other major advantage is what’s called the interconnect — essentially, the bandwidth that efficiently connects thousands of GPUs together so they can be collectively harnessed to train today’s most advanced models. In short, the key to efficient training is keeping all the GPUs fully utilized at all times, rather than sitting idle waiting for the next batch of data needed for the next step of training.
The bandwidth requirements are very high, far higher than the typical bandwidth required for traditional data center applications. This interconnect can’t use traditional networking equipment or fiber optics because they introduce too much latency and can’t deliver the multi-terabytes per second of bandwidth needed to keep all the GPUs constantly busy.
Nvidia’s acquisition of the Israeli company Mellanox for $6.9 billion in 2019 was a very smart decision, and it was this acquisition that provided them with industry-leading interconnect technology. Note that interconnect speed is more closely related to the training process (which must utilize the output of thousands of GPUs simultaneously) than the inference process (including COT inference), which only requires a small number of GPUs - all you need is enough VRAM to store the quantized (compressed) model weights of the trained model.
These are arguably the main components of Nvidia’s “moat” and the reason it has been able to maintain such high profit margins for so long (there is also a “flywheel effect” where they actively reinvest their extraordinary profits into a lot of R&D, which in turn helps them improve their technology faster than their competitors, so they always lead in terms of raw performance).
But as pointed out earlier, all other things being equal, what customers really care about is often performance per dollar (including the upfront capital expenditure cost of the equipment and energy usage, i.e. performance per watt), and while Nvidia's GPUs are indeed the fastest, they are not the most cost-effective when measured purely in FLOPS.
But the problem is that other factors are not equal, AMD's drivers suck, popular AI software libraries don't run well on AMD GPUs, you can't find GPU experts who are really good at AMD GPUs outside of gaming (why would they bother, there's more demand for CUDA experts in the market?), and you can't effectively connect thousands of GPUs together due to AMD's terrible interconnect technology - all of which means that AMD is basically uncompetitive in the high-end data center space and doesn't seem to have a very good future in the short term.
Okay, it sounds like Nvidia has a great future, right? Now you know why its stock valuation is so high! But are there any other concerns? Well, I don't think there are many concerns that deserve major attention. Some of these issues have been lurking in the background for the last few years, but given the pace of growth, their impact has been minimal. But they are poised to potentially move upward. Other issues have only recently emerged (like in the last two weeks) and could significantly change the trajectory of GPU demand growth in the near term.
Major Threats
At a macro level, you can think about it this way: Nvidia has been operating in a very niche space for quite some time; they have very limited competitors, and those competitors are not very profitable or growing fast enough to be a real threat because they don't have enough capital to really put pressure on a market leader like Nvidia. The gaming market is large and growing, but it's not generating amazing profits or particularly amazing year-over-year growth rates.
Around 2016-2017, some large tech companies started increasing their hiring and spending on machine learning and AI, but in the grand scheme of things, it was never a really important project for them - more like "moonshot" R&D spending. But the race to AI really began in 2022 with the release of ChatGPT, and while it’s only been a little over two years, it seems like ages ago in terms of the pace of development.
Suddenly, big companies were ready to invest billions of dollars at an alarming rate. The number of researchers attending large research conferences like Neurips and ICML surged. Smart students who might have previously worked on financial derivatives switched to Transformers, and million-dollar-plus compensation packages for non-executive engineering positions (i.e., independent contributors who don’t manage a team) became the norm at leading AI labs.
It takes a while to change the direction of a large cruise ship; even if you move really fast and spend billions of dollars, it takes a year or more to build a brand new data center, order all the equipment (with extended delivery times), and get everything set up and debugged. Even the smartest programmers take a long time to really get into the swing of things and get familiar with the existing code base and infrastructure.
But you can imagine that the amount of money, manpower, and energy invested in this field is absolutely astronomical. Nvidia is the biggest target of all the players because they are the biggest contributor to profits today, not in a future where AI dominates our lives.
So the bottom line is that “the market will find a way” and they will find alternative, radically innovative new ways to make hardware that use radically new ideas to get around barriers and thus strengthen Nvidia’s moat.
Threats at the Hardware Level
For example, Cerebras’ so-called “wafer-scale” AI training chips use an entire 300mm silicon wafer for an absolutely massive chip that contains orders of magnitude more transistors and cores on a single die (see their recent blog post to learn how they solved the yield issues that prevented this approach from being economically practical in the past).
To put this into context, if you compare Cerebras’ latest WSE-3 chip to Nvidia’s flagship datacenter GPU, the H100, the Cerebras chip has a total die area of 46,225 square millimeters, while the H100 is only 814 square millimeters (the H100 itself is a huge chip by industry standards); that’s a multiple of 57 times! Instead of having 132 “streaming multiprocessor” cores enabled on the chip like the H100, the Cerebras chip has about 900,000 cores (granted, each core is smaller and has fewer features, but that’s still a very large number by comparison). Specifically in the field of artificial intelligence, the Cerebras chip has about 32 times the FLOPS of a single H100 chip. Since the H100 chip costs nearly $40,000, it’s no wonder that the WSE-3 chip is not cheap either.
So, what’s the point? Rather than trying to take on Nvidia head-on with a similar approach, or trying to match Mellanox’s interconnect technology, Cerebras is taking a completely new approach to getting around the interconnect problem: when everything runs on the same very large chip, the bandwidth issue between processors becomes less important. You don’t even need the same level of interconnect, because one giant chip can replace tons of H100s.
And the Cerebras chip performs very well in AI inference tasks, too. In fact, you can try it for free here today, using Meta’s very famous Llama-3.3-70B model. Its response speed is basically instant, about 1,500 tokens per second. For comparison purposes, compared to ChatGPT and Claude, speeds of more than 30 tokens per second are relatively fast for users, and even 10 tokens per second is fast enough to basically read the response while it is being generated.
Cerebras isn’t the only company out there, there are others like Groq (not to be confused with the Grok family of models trained by Elon Musk’s X AI). Groq takes a different, innovative approach to solving the same basic problem. Rather than trying to compete directly with Nvidia’s CUDA software stack, they’ve developed what they call “Tensor Processing Units” (TPUs) that are specifically designed for the precise math required of deep learning models. Their chips are designed around the concept of “deterministic computing,” meaning that unlike traditional GPUs, their chips perform operations in a completely predictable way every time.
This may sound like a minor technical detail, but it actually has huge implications for both chip design and software development. Because timing is completely deterministic, Groq can optimize its chips in ways that traditional GPU architectures can’t. As a result, for the past 6+ months, they’ve been demonstrating inference speeds of over 500 tokens per second for the Llama family of models and other open source models, far exceeding what can be achieved with traditional GPU setups. Like Cerebras, this is available now, and you can try it for free here.
Using the Llama3 model with “speculative decoding”, Groq was able to generate 1320 tokens per second, which is comparable to Cerebras and far exceeds the performance using regular GPUs. Now, you might ask what the point of reaching 1000+ tokens per second is when users seem to be quite happy with ChatGPT’s speed (less than 1000 tokens per second). In fact, it does matter. When you get instant feedback, you iterate faster and don’t lose focus like human knowledge workers. If you use the model programmatically through the API, it can enable entirely new categories of applications that require multi-stage reasoning (outputs of previous stages are used as inputs for prompts/reasoning of subsequent stages), or that require low-latency responses, such as content moderation, fraud detection, dynamic pricing, etc.
But more fundamentally, the faster you can respond to requests, the faster you can cycle and the busier your hardware can be. While Groq's hardware is very expensive, costing as much as $2-3 million for a single server, if demand is high enough to keep the hardware busy, the cost per completed request can be significantly reduced.
Like Nvidia's CUDA, a large part of Groq's advantage comes from its proprietary software stack. They are able to take open source models that other companies like Meta, DeepSeek, and Mistral have developed and released for free, and break them down in special ways to make them run faster on their specific hardware.
Like Cerebras, they made different technical decisions to optimize certain specific aspects of the process, which allows them to do things in a completely different way. In Groq's case, they focus entirely on computation at the inference level, not training: all of their special hardware and software only delivers huge speed and efficiency advantages when doing inference on models that have already been trained.
But if the next big scaling law people are looking forward to is inference-level computing, and the biggest drawback of the COT model is that all the intermediate logical tags must be generated before responding, resulting in excessive latency, then even a company that only does inference computing, as long as its speed and efficiency are far beyond Nvidia, will pose a serious competitive threat in the next few years. At the very least, Cerebras and Groq can eat into the excessive expectations of Nvidia's revenue growth in the current stock valuation for the next 2-3 years.
In addition to these particularly innovative but relatively unknown startup competitors, some of Nvidia's largest customers themselves have also brought serious competition, and they have been making custom chips specifically for AI training and inference workloads. The most famous of these is Google, which has been developing its own proprietary TPU since 2016. Interestingly, although Google briefly sold TPUs to external customers, Google has been using all of its TPUs internally for the past few years, and it has launched the sixth generation of TPU hardware.
Amazon is also developing its own custom chips, called Trainium2 and Inferentia2. Amazon is building data centers equipped with billions of dollars of Nvidia GPUs, and at the same time, they are investing billions of dollars in other data centers that use these internal chips. They have a cluster that is coming online for Anthropic that has over 400,000 chips.
Amazon has been criticized for completely screwing up internal AI model development, wasting a lot of internal computing resources on models that ultimately weren't competitive, but custom chips are another matter. Again, they don't necessarily need their chips to be better or faster than Nvidia's. They just need them to be good enough, but to be made at a break-even gross margin, not the ~90%+ gross margin that Nvidia makes on its H100 business.
OpenAI also announced their plans to make custom chips, and they (along with Microsoft) are apparently the largest user of Nvidia's data center hardware. As if that wasn't enough, Microsoft announced its own custom chip!
And Apple, the world’s most valuable technology company, has been subverting expectations for years with a highly innovative and disruptive custom chip business that now thoroughly beats Intel and AMD’s CPUs in terms of performance per watt, which is the most important factor in mobile (phone/tablet/laptop) applications. They have been producing their own in-house designed GPUs and “neural processors” for years, although they have yet to truly prove the usefulness of these chips outside of their custom applications, such as the advanced software-based image processing used in the iPhone camera.
While Apple’s focus seems to be somewhat different from these other players, with its focus on mobile-first, consumer-oriented, and “edge computing”, if Apple ends up investing enough money in its new contract with OpenAI to provide AI services to iPhone users, then you have to imagine that they have teams working on how to make their own custom chips for inference/training (although given their secrecy, you may never know this directly!).
It’s no secret now that Nvidia’s super-scaler customer base exhibits a strong power-law distribution, with a handful of top customers accounting for the vast majority of high-margin revenues. How should we view the future of this business when each of these VIP customers is making their own custom chips specifically for AI training and inference?
As you ponder these questions, you should remember a very important fact: Nvidia is very much an IP-based company. They don’t make their own chips. The truly special secret to making these incredible devices probably comes more from TSMC and ASML, which make the special EUV lithography machines used to make these leading-edge process node chips. This is critical because TSMC will sell its most advanced chips to any customer willing to provide enough upfront investment and guarantee a certain volume. They don’t care whether these chips are for Bitcoin mining ASICs, GPUs, TPUs, mobile phone SoCs, etc.
What Nvidia’s senior chip designers make per year, these tech giants can certainly offer enough cash and stock to lure some of their best talent to jump ship. Once they have the team and resources, they could design innovative chips in 2-3 years (maybe not even 50% as advanced as the H100, but with Nvidia's gross margins, they have plenty of room to grow), and thanks to TSMC, they can turn those chips into actual silicon using the exact same process node technology as Nvidia.
Software Threats
As if these looming hardware threats weren't bad enough, there have been some developments in the software space over the past few years that, while slow to start, are now gaining momentum and could pose a serious threat to Nvidia's CUDA software dominance. First up are the terrible Linux drivers for AMD GPUs. Remember when we discussed how AMD unwisely allowed these drivers to be so terrible for years while sitting back and losing a ton of money? Interestingly, notorious hacker George Hotz (famous for jailbreaking the original iPhone as a teenager and currently CEO of self-driving startup Comma.ai and AI computer company Tiny Corp, which also developed the open source TinyGrad AI software framework) recently announced that he was tired of dealing with AMD's poor drivers and was eager to be able to use lower-cost AMD GPUs in his TinyBox AI computers (there are multiple models, some of which use Nvidia GPUs and others use AMD GPUs).
In fact, he made his own custom driver and software stack for AMD GPUs without AMD’s help; on January 15, 2025, he tweeted from the company’s X account: “We are one step away from AMD’s fully own stack RDNA3 assembler. We have our own driver, runtime, libraries and simulator. (All in about 12,000 lines!)” Given his track record and skills, they will likely have it all done in the coming months, which will open up a lot of exciting possibilities for using AMD GPUs for a variety of applications that companies currently have to pay for Nvidia GPUs.
Okay, that’s just one driver for AMD, and it’s not finished yet. What else is there? Well, there are other areas on the software side that have a bigger impact. First, there is now a joint effort by many large tech companies and the open source software community to develop more general AI software frameworks, of which CUDA is just one of many “compile targets”.
That is, you write your software using higher-level abstractions, and the system itself automatically converts these high-level constructs into super-optimized low-level code that runs extremely well on CUDA. But because it's done at this higher-level abstraction layer, it can easily be compiled down to low-level code that runs well on many other GPUs and TPUs from a variety of vendors, such as the large number of custom chips being developed by major tech companies.
The most notable examples of these frameworks are MLX (primarily sponsored by Apple), Triton (primarily sponsored by OpenAI), and JAX (developed by Google). MLX is particularly interesting because it provides a PyTorch-like API that runs efficiently on Apple Silicon, demonstrating how these abstraction layers can enable AI workloads to run on completely different architectures. Meanwhile, Triton is gaining in popularity because it allows developers to write high-performance code that can be compiled to run on a variety of hardware targets without having to understand the low-level details of each platform.
These frameworks allow developers to write code using powerful abstractions and then automatically compile it for a large number of platforms - does that sound more efficient? This approach allows for greater flexibility when it comes to actually running the code.
In the 1980s, all of the most popular, best-selling software was written in hand-tweaked assembly language. For example, the PKZIP compression utility was handcrafted to maximize speed, to the point that a version of the code written in the standard C programming language and compiled with the best optimizing compilers of the time might run only half as fast as the hand-tweaked assembly code. The same was true for other popular software packages like WordStar, VisiCalc, and others.
Over time, compilers became more powerful, and whenever CPU architecture changed (e.g., from Intel releasing the 486 to the Pentium, and so on), handwritten assembly programs usually had to be thrown away and rewritten, and only the smartest programmers could do the job (just like CUDA experts have an edge in the job market over "normal" software developers). Eventually, things converged, and the speed advantage of hand-crafted assembly was greatly outweighed by the flexibility of writing code in a higher-level language like C or C++, which relies on a compiler to make the code run optimally on a given CPU.
These days, few people write new code in assembly. I believe a similar shift will eventually happen to AI training and inference code, for much the same reasons: computers are good at optimization, and flexibility and speed of development become increasingly important factors - especially if it also saves a lot of money on hardware because you don't have to keep paying the "CUDA tax" that generates more than 90% of Nvidia's profits.
However, another area where big changes could happen is that CUDA itself may end up being a high-level abstraction — a “specification language” similar to Verilog (as an industry standard for describing chip layouts) that skilled developers can use to describe high-level algorithms involving massive parallelism (because they are already familiar with it, it is well-structured, it is a general-purpose language, etc.), but instead of being compiled for Nvidia GPUs, as is common practice, the code is fed as source code into LLM, which can convert it into whatever low-level code the new Cerebras chip, the new Amazon Trainium2, or the new Google TPUv6, etc., can understand. This is not as far away as you might think; with OpenAI’s latest O3 model, it may already be within reach, and will certainly be generally available in a year or two.
Theoretical Threats
Perhaps the most shocking development of all happened in the last few weeks. This news rocked the AI world to its core, and while the mainstream media didn’t mention it at all, it became a trending topic among intellectuals on Twitter: a Chinese startup called DeepSeek released two new models that roughly match the performance levels of the best models from OpenAI and Anthropic (surpassing the Meta Llama3 model and other smaller open source models like Mistral). The models are named DeepSeek-V3 (basically a response to GPT-4o and Claude3.5 Sonnet) and DeepSeek-R1 (basically a response to OpenAI’s O1 model).
Why is all this so shocking? First of all, DeepSeek is a small company with reportedly less than 200 employees. It’s said that they started out as a quantitative trading hedge fund similar to TwoSigma or RenTec, but used their math and engineering expertise to switch to AI research after China stepped up regulation of the field. But the fact is that they released two very detailed technical reports, DeepSeek-V3 and DeepSeekR1.
These are highly technical reports, and if you don’t know anything about linear algebra, they may be hard to follow. But what you should try is to download the DeepSeek app for free on the AppStore, log in with your Google account and install it, and give it a try (you can also install it on Android), or just try it in a browser on your desktop. Make sure to select the “DeepThink” option to enable the thought chain (R1 model) and let it explain parts of the technical report in simple language.
This will also tell you a few important things:
First, this model is absolutely legit. There is a lot of fake stuff in AI benchmarks, which are often manipulated to make models perform well in the benchmark but perform poorly in real-world tests. Google is undoubtedly the biggest culprit in this regard, they always brag about how amazing their LLMs are, but in fact, these models perform terribly in real-world tests and cannot reliably complete even the simplest tasks, let alone challenging coding tasks. The DeepSeek model, on the other hand, responds coherently and robustly, putting it squarely on the same level as the OpenAI and Anthropic models.
Second, DeepSeek has made significant progress not only in model quality, but more importantly in model training and inference efficiency. By being very close to the hardware, and by combining some unique and very clever optimizations together, DeepSeek is able to use GPUs to train these incredible models in a way that is dramatically more efficient. By some measurements, DeepSeek is about 45 times more efficient than other cutting-edge models.
DeepSeek claims that the full cost of training DeepSeek-V3 was just over $5 million. This is nothing by the standards of companies like OpenAI, Anthropic, etc., which have reached a level of training costs of over $100 million for a single model as early as 2024.
How is this possible? How is it possible that this small Chinese company could completely outsmart all the smartest people in our leading AI labs that have 100x more resources, headcount, salaries, capital, GPUs, etc? Shouldn’t China be crippled by Biden’s restrictions on GPU exports? OK, the details are pretty technical, but we can at least describe it in general terms. Perhaps it turns out that DeepSeek’s relatively weak GPU processing power was the key factor in their creativity and ingenuity, because “necessity is the mother of invention.”
One major innovation is their advanced mixed precision training framework, which allows them to use 8-bit floating point (FP8) throughout the training process. Most Western AI labs use “full-precision” 32-bit numbers for training (this basically specifies the number of possible gradients when describing the output of an artificial neuron; the 8 bits in FP8 can store a wider range of numbers than you might think — it’s not limited to the 256 different sized equal quantities in regular integers, but uses clever mathematical tricks to store very small and very large numbers — though with less natural precision than 32 bits.) The main trade-off is that while FP32 can store numbers with amazing precision over a wide range, FP8 sacrifices some precision for memory savings and performance, while still maintaining enough precision for many AI workloads.
DeepSeek solved this problem by developing a clever system that breaks the numbers into small chunks for activations and chunks for weights, and strategically uses high-precision computations at key points in the network. Unlike other labs that train at high precision first and then compress it (losing some quality in the process), DeepSeek’s FP8-native approach means they can save a lot of memory without affecting performance. When you’re training on thousands of GPUs, the memory requirements per GPU are drastically reduced, which means you need a lot fewer GPUs overall.
Another big breakthrough is their multi-label prediction system. Most Transformer-based LLM models do inference by predicting the next label — one label at a time.
DeepSeek figured out how to predict multiple labels while maintaining the quality of the single-label predictions. Their approach achieves about 85-90% accuracy on these extra label predictions, effectively doubling the speed of inference without sacrificing much quality. The neat thing is that they keep the full causal chain of predictions, so the model isn’t just guessing, but making structured, contextually relevant predictions.
One of their most innovative developments is what they call Multi-Head Latent Attention (MLA). This is their breakthrough in handling what’s called key-value indexing, which is basically how individual tokens are represented in the attention mechanism in the Transformer architecture. While this is a bit overly technical, suffice it to say that these KV indexes are one of the main uses of VRAM during training and inference, and are part of the reason why you need to use thousands of GPUs simultaneously to train these models - each GPU has a maximum VRAM of 96GB, and these indexes eat up all that memory.
Their MLA system found a way to store compressed versions of these indexes that use less memory while capturing the essential information. The best part is that this compression is built directly into the way the model learns - this is not some separate step they need to do, but built directly into the end-to-end training pipeline. This means that the whole mechanism is "differentiable" and can be trained directly with standard optimizers. This works because the underlying data representation that these models end up finding is far below the so-called "environment dimension". Therefore, it is a waste to store the full KV index, even though everyone else is basically doing this.
Not only does it lead to a huge improvement in training memory footprint and efficiency because of wasting a lot of space by storing more data than you actually need (again, the number of GPUs required to train world-class models is greatly reduced), but it can actually improve model quality because it acts as a "regulator" that forces the model to focus on what is really important instead of wasting capacity to adapt to noise in the training data. So not only do you save a lot of memory, but the performance of your model may even be better. At the very least, you don't severely affect performance by saving a lot of memory, which is usually the trade-off you face in AI training.
They also made major advances in GPU communication efficiency with the DualPipe algorithm and custom communication kernels. The system intelligently overlaps computation and communication, carefully balancing GPU resources between tasks. They only need about 20 of the GPU's streaming multiprocessors (SMs) for communication, and the rest are used for computation. The result is GPU utilization that is much higher than a typical training setup.
Another very smart thing they did was use what’s called a mixture of experts (MOE) Transformer architecture, but with key innovations around load balancing. As you may know, the size or capacity of an AI model is often measured in terms of the number of parameters the model contains. A parameter is just a number that stores some property of the model; for example, the “weight” or importance of a particular artificial neuron relative to another neuron, or the importance of a particular token based on its context (in an “attention mechanism”), etc.
Meta’s latest Llama3 model comes in several sizes, such as a 1B parameter version (the smallest), a 70B parameter model (the most commonly used), and even a large model with 405B parameters. For most users, this largest model has limited practicality, as your computer would need to be equipped with a GPU worth tens of thousands of dollars to run inference at an acceptable speed, at least if you deploy the original full-precision version. As a result, most of the real-world use and excitement about these open source models is at the 8B parameter or highly quantized 70B parameter level, because that’s what a consumer-grade Nvidia 4090 GPU can accommodate, which you can buy for less than $1,000 these days.
So, what’s the point? In a sense, the number and precision of parameters tells you how much raw information or data is stored inside the model. Note that I’m not talking about reasoning ability, or the “IQ” of the model: it turns out that even models with a small number of parameters can demonstrate remarkable cognitive abilities in solving complex logic problems, proving theorems in plane geometry, SAT math problems, and so on.
But those small models won’t necessarily be able to tell you every aspect of every plot twist in every Stendhal novel, while the really big models will likely be able to do that. The “price” of this extreme level of knowledge is that the model becomes very cumbersome and difficult to train and reason about, because in order to reason about the model, you always need to store every one of the 405B parameters (or whatever the number of parameters is) in the GPU’s VRAM at the same time.
The advantage of the MOE model approach is that you can decompose the large model into a series of smaller models, each of which has different, non-overlapping (at least not completely overlapping) knowledge. DeepSeek’s innovation is to develop a load balancing strategy they call “no auxiliary loss” that maintains efficient utilization of experts without the performance degradation that load balancing usually brings. Then, based on the nature of the inference request, you can intelligently route the inference to the “expert” model in the collection of smaller models that is best able to answer the question or solve the task.
You can think of it as a committee of experts who have their own areas of expertise: one might be a legal expert, another might be a computer science expert, and another might be a business strategy expert. So if someone asks a linear algebra question, you don’t give it to the legal expert. Of course, this is just a very rough analogy, it doesn’t really work like this.
The real advantage of this approach is that it allows the model to contain a lot of knowledge without being very unwieldy, because even if the total number of parameters across all experts is high, only a small fraction of the parameters are “active” at any given time, which means you only need to store a small subset of the weights in VRAM for inference. Take DeepSeek-V3, for example, which has an absolutely massive MOE model with 671B parameters, much larger than the largest Llama3 model, but only 37B of them are active at any given time—enough to fit in the VRAM of two consumer-grade Nvidia 4090 GPUs (which cost less than $2,000 in total), rather than one or more H100 GPUs, which cost around $40,000 each.
ChatGPT and Claude are both rumored to use the MoE architecture, and it has been revealed that GPT-4 has a total of 1.8 trillion parameters, spread across 8 models, each containing 220 billion parameters. While this is much easier than fitting all 1.8 trillion parameters into VRAM, the sheer amount of memory used requires multiple H100-class GPUs just to run the model.
In addition to the above, the technical paper mentions several other key optimizations. These include its extremely memory-efficient training framework, which avoids tensor parallelism, recomputes certain operations during backpropagation instead of storing them, and shares parameters between the main model and auxiliary prediction modules. The sum of all these innovations, when layered together, leads to the ~45x efficiency improvement numbers floating around online, which I’m fully willing to believe are correct.
DeepSeek’s API costs are a strong example: despite DeepSeek’s model performance being nearly best-in-class, the cost of making inference requests through its API is 95% lower than similar models from OpenAI and Anthropic. In a sense, this is a bit like comparing Nvidia’s GPUs to new custom chips from competitors: even if they’re not as good, they’re much more cost-effective, so as long as you can determine the performance level and prove that it’s good enough for your requirements, and the API availability and latency are good enough (so far, people have been surprised by how well DeepSeek’s infrastructure has performed, despite the incredible surge in demand due to the performance of these new models).
But unlike in the case of Nvidia, where the cost difference is due to the 90%+ monopoly gross margins they get on their datacenter products, the cost difference of the DeepSeek API relative to the OpenAI and Anthropic APIs is probably just due to the fact that they’re nearly 50x more computationally efficient (and probably even much more than that on the inference side — on the training side, they’re about 45x more efficient). In fact, it’s unclear whether OpenAI and Anthropic are making a lot of money from their API services — they’re probably more focused on revenue growth and collecting more data by analyzing all the API requests they receive.
Before I go on, I must point out that a lot of people have speculated that DeepSeek lied about the number of GPUs and GPU time spent training these models because they actually have more H100s than they claim because there are export restrictions on these cards and they don't want to get themselves in trouble or hurt their chances of getting more of these cards. While that's certainly possible, I think it's more likely that they're telling the truth and that they achieved these incredible results simply by showing extreme ingenuity and creativity in their training and inference methods. They explained how they did it, and I guess it's only a matter of time before their results are widely replicated and confirmed by other researchers in other labs.
Models That Really Think
The newer R1 models and technical reports are likely to be even more astounding because they beat Anthropic on the thought chain and now they're basically the only ones besides OpenAI making this technology work at scale. But note that OpenAI only released the O1 preview model in mid-September 2024. That was only about 4 months ago! One thing you have to keep in mind is that OpenAI is very secretive about how these models actually work at a low level, and won't reveal the actual model weights to anyone except partners like Microsoft who have signed strict NDAs. DeepSeek's models are completely different, they are fully open source and licensed permissively. They publish very detailed technical reports explaining how these models work, and provide code that anyone can review and try to replicate.
With R1, DeepSeek has basically cracked a hard problem in AI: getting models to reason incrementally without relying on large supervised datasets. Their DeepSeek-R1-Zero experiment shows this: using pure reinforcement learning with a carefully designed reward function, they managed to get the model to develop complex reasoning capabilities completely autonomously. It's not just problem solving - the model organically learns to generate long chains of thought, self-validate its work, and allocate more computational time to harder problems.
The technical breakthrough here is their novel approach to reward modeling. Rather than using a complex neural reward model, which can lead to “reward hacking” (where the model boosts rewards in spurious ways that don’t actually improve the model’s true performance), they developed a clever rule-based system that combines a reward for accuracy (which validates the final answer) with a reward for format (which encourages structured thinking). This simpler approach proved to be more powerful and scalable than the process-based reward models that others had tried.
Particularly fascinating, during training, they observed so-called “aha moments,” where the model spontaneously learned to revise its thought process mid-stream when faced with uncertainty. This emergent behavior wasn’t pre-programmed, but rather emerged naturally from the model’s interaction with the reinforcement learning environment. The model would literally stop, flag potential problems in its reasoning, and then start over with a different approach, all without being explicitly trained.
The full R1 model builds on these insights by introducing what they call “cold start” data — a small set of high-quality examples — before applying their reinforcement learning techniques. They also tackled a big problem in reasoning models: language consistency. Previous attempts at thought-chaining reasoning often resulted in models that mixed multiple languages or produced incoherent output. DeepSeek solves this problem by cleverly rewarding language consistency during RL training, trading a small performance penalty for more readable and consistent output.
The results are incredible: on AIME 2024 (one of the most challenging high school math competitions), R1 achieved 79.8% accuracy, comparable to OpenAI’s O1 model. On MATH-500, it achieved 97.3%, and scored 96.3% in the Codeforces programming competition. But perhaps most impressive is that they managed to distill these capabilities into smaller models: their 14B parameter version performed better than many models that were several times larger, showing that reasoning power is not just about the raw number of parameters, but also about how you train the model to process information.
Aftermath
The recent gossip circulating on Twitter and Blind (a corporate rumor site) is that these models are totally unexpected by Meta, and that they outperform even the new Llama4 model that is still being trained. Apparently the Llama project inside Meta has caught the attention of the top tech executives, so they have about 13 people working on Llama, and the combined annual salary of each of them exceeds the combined training cost of the DeepSeek-V3 model, which performs better than Llama. How do you explain this to Zuckerberg with a straight face? How can he keep smiling when Zuckerberg is throwing billions of dollars at Nvidia to buy 100,000 H100s when the better model was trained with only 2,000 H100s for less than $5 million?
But you better believe that Meta and other large AI labs are disassembling these DeepSeek models, studying every word in the technical reports and every line of the open source code they released, desperately trying to incorporate these same tricks and optimizations into their own training and inference processes. So, what is the impact of all this? Well, it's naive to think that the total training and inference compute requirements should be divided by some large number. Maybe not 45, but 25 or even 30? Because whatever you thought you needed before, now it's a lot less.
An optimist might say, “You’re just talking about a simple constant of proportionality, a single multiple. When you’re dealing with an exponential growth curve, this stuff will vanish very quickly and won’t matter that much in the end.” And there is some truth to that: if AI is as transformative as I expect it to be, if the actual utility of this technology is measured in trillions, if inference time computation is the new scaling law, and if we’re going to have a lot of humanoid robots that are constantly doing a lot of inference, then maybe the growth curve is still very steep and extreme, and Nvidia is still so far ahead that it will still succeed.
But Nvidia has a lot of good news to say in the next few years to sustain its valuation, and when you factor that in, I’m starting to feel very uncomfortable buying its stock at at least 20 times estimated 2025 sales. What happens if sales growth slows a bit? What if the growth rate is 85% instead of 100%+? What happens if gross margins drop from 75% to 70%, which is still high for a semiconductor company?
Summary
At a macro level, Nvidia faces unprecedented competitive threats, making it increasingly difficult to justify its high valuation at 20x forward sales and 75% gross margins. The company’s strengths in hardware, software, and efficiency are all showing worrying cracks. The world—tens of thousands of the smartest people on the planet, backed by untold billions of dollars in capital resources—is trying to attack them from every angle.
On the hardware side, Cerebras and Groq’s innovative architectures show that Nvidia’s interconnect advantage (the cornerstone of its data center dominance) can be circumvented through a complete redesign. Cerebras’ wafer-scale chips and Groq’s deterministic computing approach deliver compelling performance without Nvidia’s complex interconnect solutions. More traditionally, each of Nvidia’s major customers (Google, Amazon, Microsoft, Meta, Apple) is developing custom chips that could cannibalize high-margin data center revenues. These are no longer experimental projects—Amazon alone is building a massive infrastructure for Anthropic with over 400,000 custom chips.
Software moats appear equally fragile. New high-level frameworks such as MLX, Triton, and JAX are undermining the importance of CUDA, while efforts to improve AMD drivers could lead to the development of cheaper hardware alternatives. The trend toward higher-level abstractions mirrors how assembly language gave way to C/C++, suggesting that CUDA’s dominance may be more transient than thought. On top of that, we’re seeing the rise of LLM-based code translation technology that automatically ports CUDA code to run on any hardware target, potentially removing one of NVIDIA’s strongest lock-in effects.
Perhaps most disruptive of all is DeepSeek’s recent breakthrough in efficiency, which achieved comparable model performance at about 1/45 the computational cost. This suggests that the industry as a whole has been massively over-provisioning compute resources. Coupled with the emergence of more efficient