The NVIDIA AI Moat That Isn’t

Jul 01, 2026

What LLMs (Large Language Models) Actually Need

The core component of the AI that is setting the world on fire at the moment is the LLM.

Strip away the mystique and it is a simple program. It multiplies matrices against an astronomical amount of data, then does it again, billions of times.

The chip in your phone is running an operating system, a cellular modem, a camera pipeline, a GPU rendering a 3D interface, audio codecs, secure enclaves, neural accelerators for face ID, and a dozen background services — juggling interrupts and power states across all of them in real time. The computational complexity of what your phone does every second is orders of magnitude greater than what an LLM does.

What an LLM does is bigger, not more complicated. The workload is embarrassingly parallel matrix math on a scale nothing else in computing has ever required. That’s it. The magic isn’t in the operation. It’s in the amount.

This matters because moats are a protection requiring something hard to step over or go around. A workload this simple, this uniform, this well-understood, does not naturally support a monopoly on the silicon that runs it. You don’t need NVIDIA to multiply matrices. You need someone’s silicon to multiply matrices, at scale, with acceptable software support.

The question of who that someone has to be — and for how long — is the actual question.

The Architecture LLMs Actually Need

Because the workload is simple and uniform, the hardware to run it can be simple and uniform. A statically scheduled machine — one where the compiler decides in advance what happens on every cycle, and the hardware just executes — is enough. No branch prediction, no out-of-order execution, no speculative anything. The compiler knows the shape of the computation. It lays it out. The chip runs it.

For a workload this predictable, static scheduling is dramatically more efficient in power and performance than a GPU. NVIDIA’s chips are general-purpose parallel machines carrying decades of accumulated flexibility — graphics pipelines, gaming, scientific computing, sparse workloads, dynamic control flow. All of that costs silicon and watts. For matrix multiplication at scale, most of it is dead weight.

There’s a piece of folklore that static scheduling works for inference but not for training. It isn’t true. Google’s TPU is a statically scheduled machine. Google trained Gemini 3 end-to-end on TPUs and runs Search, Photos, Maps, and everything else on the same hardware. If static scheduling couldn’t handle training, Google would have discovered that at some point across a decade of shipping the world’s largest models on it.

The architecture the workload demands is simpler than what NVIDIA sells. It’s also cheaper to build, cheaper to run, and easy to design around. The general-purpose GPU is not the natural fit for this problem. It’s the incumbent that happened to be there when the wave arrived.

What NVIDIA Just Paid to Admit

In December 2025, NVIDIA paid $20 billion for Groq — its largest deal ever. Groq is a statically scheduled machine designed by Jonathan Ross, an ex-Google engineer who was on the original TPU team. The chip is built on the same principle as the TPU: the compiler lays out every operation in advance, and the hardware executes deterministically. No branch prediction. No dynamic scheduling. Silicon spent on math, not on hedging.

The deal was structured as a non-exclusive license plus a mass hire — a “reverse acqui-hire” designed to slip under Hart-Scott-Rodino review. Senators Warren and Blumenthal opened a Senate inquiry into the structure in March. Jensen Huang, at GTC 2026, framed Groq as completing NVIDIA’s GPU rather than competing with it — the same story he told about Mellanox. NVIDIA claims the combined system delivers up to 35× better tokens per watt than Blackwell alone.

Read that again. Thirty-five times better tokens per watt. That is not a companion. That is the admission.

The general-purpose GPU is not what the workload wants. The industry knows it. NVIDIA knows it — which is why they paid $20 billion to own the architecture that isn’t theirs.

They Don’t Know How to Build One

NVIDIA has no institutional experience with statically scheduled machines. Their entire history is dynamically scheduled parallel hardware — the GPU. Compiler-and-hardware co-design is a different engineering discipline, and it is not one you hire in a quarter.

Google has been shipping statically scheduled machines for roughly a decade, across seven TPU generations, with hundreds to perhaps a thousand engineers steeped in the discipline. That is only the public history — production silicon of that scope rarely appears without earlier internal experiments that never see daylight, and the TPU is almost certainly no exception. NVIDIA bought Groq. Groq is one branch of the TPU family tree, staffed by people who came out of Google. The rest of that tree is still at Google, or spread across the other statically scheduled programs in the industry. NVIDIA acquired one offshoot, not the discipline.

The Field NVIDIA Is Trying to Catch

Statically scheduled machines are the future of AI silicon. In Google’s case they are already the present, and have been for a decade. NVIDIA has no moat here. They came into this architecture late, and they are playing catch-up by buying someone else’s technology.

The field they are catching:

Hyperscaler in-house silicon. Google TPU (statically scheduled, powers Gemini and all of Google’s AI-serving infrastructure). Amazon Trainium and Inferentia. Meta MTIA. Microsoft Maia (architectural details less public, but designed in the same systolic-array family). Alibaba Hanguang, Baidu Kunlun, and Huawei Ascend on the Chinese side. Every one of these is a statically scheduled machine shipping at scale. NVIDIA has no moat in any of them.

Independent chip companies. Groq — now NVIDIA’s, but only after paying $20 billion for what NVIDIA didn’t build. MatX, compiler-first for LLMs. SiMa.ai at the edge. Intel’s Habana Gaudi, VLIW tensor cores, compiler-scheduled. Etched, transformer pipeline baked into silicon.

NVIDIA’s response to this field is not a superior general-purpose GPU. It is a Groq LPU sitting next to a Vera Rubin GPU in the same rack, with Huang selling the pairing as a completion story. The completion story is what companies tell when they have to buy the thing they should have built. It also rests on a false premise — that a GPU is still required for training, or for some class of workloads a statically scheduled machine can’t handle. The TPU has been doing all of it for a decade, without a GPU in sight.

The market is pricing NVIDIA as if the GPU wins the workload. The industry is spending tens of billions of dollars to make sure it doesn’t.

Four Trillion Dollars Buys You Competitors

A $4.66 trillion valuation is not a fortress. It is a signal flare. It tells every competent chip team in the world that the incumbent is charging monopoly prices for a workload that does not require monopoly technology. Capital, talent, and hyperscaler purchase orders follow that signal.

The switching cost story — the CUDA moat — is real for the long tail of legacy scientific computing, graphics, and mixed workloads. For LLMs it is thinner than NVIDIA’s price implies. Anthropic already runs across NVIDIA, Google TPU, and Amazon Trainium. OpenAI has a Cerebras deal for 750 megawatts. Every serious AI shop is multi-vendor by policy, because single-vendor dependency at these dollar amounts is malpractice.

The technical barrier to adopting a new chip is not what NVIDIA’s bulls think it is. You do not need a full PyTorch stack. You do not need decades of CUDA libraries. You need enough software to run your specific models — a compiler that lowers your graph to the target, a runtime, and a handful of kernels for the operations you actually use. That is a real engineering effort. It is not a decade-long moon shot. Everyone on the list above has done it, or is doing it, right now.

The workload is matrix multiplication. The bar for competing is: can your chip do the matrix multiplies at competitive tokens per watt, and can your compiler get the customer’s models onto it. That is the bar. That is not a $4.66 trillion bar.

Other Mousetraps

Statically scheduled machines are the sharpest bet against the GPU, but not the only one. There are more dynamic non-GPU architectures — dataflow, reconfigurable fabrics, wafer-scale integration — and a growing bench of special-purpose designs targeting specific model families. Beyond silicon there are photonics, in-memory compute, and other physical substrates being funded and taped out. The Korean programs, Rebellions and FuriosaAI, sit in this second front. The moat is being attacked from more than one direction.

Good Enough Beats Best

Hyperscalers do not buy silicon the way a gamer buys a graphics card. They buy on total cost of ownership per useful token, at the scale of a data center full of racks, powered by contracted megawatts, cooled by water they had to negotiate rights to.

At that scale, “best chip” is not the metric. “Cheapest way to serve the workload” is the metric. A hyperscaler chip that runs at a quarter of an NVIDIA GPU’s raw performance can still win — if it costs a quarter as much to buy, or uses a quarter as much power, or ships without a two-year lead time and an 80% gross margin markup going to Santa Clara.

The competitive threat is not a chip that beats the GPU on peak performance. It is a chip that is worse on peak performance and still cheaper to run the workload. It has to win the invoice, not the benchmark. And the customers writing those invoices are the same four companies that account for most of NVIDIA’s revenue.

The Software Moat Is Getting Cheaper by the Month

The one place NVIDIA’s bulls plant their flag is CUDA — the ecosystem, the libraries, the twenty years of accumulated software that runs on NVIDIA and nothing else. This is real. It is also shrinking as a moat, and it is shrinking fast. AI coders are collapsing the cost of building software stacks. Compilers, runtimes, kernel libraries, framework integrations — the work that used to require hundreds of specialist engineers over years is now being done by teams a fraction of the size, in a fraction of the time. Ports that used to be quoted in engineer-decades are quoted in engineer-months.

NVIDIA’s CUDA moat was built against expensive human software. That cost is falling. When the cost of the moat falls, the moat falls with it.

The Moat Won’t Last

The LLM workload is matrix multiplication at scale. The architecture the workload wants is a statically scheduled machine, and Google has been shipping one for a decade. Every hyperscaler and every serious startup is now shipping the same. NVIDIA came late, paid $20 billion for a branch of the TPU family tree, and is trying to integrate a discipline it has no history in. Its CUDA moat, meanwhile, is being drained by AI-assisted coding that collapses the cost of building competing software stacks.

The moat is real today. It seems unlikely to last for more than a few years. The math of what happens to NVIDIA’s valuation when it doesn’t is the subject of a follow-on piece.

NVIDIA may have invented the vacuum tube but the transistor is coming.

Cranky Old Guy

Discussion about this post

Ready for more?