Will Frontier AI Become a Commodity—or an Oligopoly?
A central question for investors today is how to value frontier AI companies. Are they building the equivalent of commodity infrastructure—destined for margin compression and intense competition? Or are they constructing something closer to Boeing, Airbus, or Synopsys: industries where only a handful of firms dominate for decades?
The answer is not obvious. In fact, AI may be the rare technology where both outcomes are happening at once.
The Illusion of Commoditization
At first glance, the case for commoditization is compelling.
AI capabilities are spreading rapidly. Models that once seemed extraordinary—writing code, summarizing documents, reasoning through problems—are now widely accessible. APIs price intelligence per token. Open-weight models continue to improve. Costs are falling. From the perspective of many users, AI already behaves like a commodity.
This is not an illusion. But it is incomplete.
The New Search
The clearest example of genuine commoditization is also the most visible: AI is replacing search.
When someone asks a question that used to go to Google — a recipe, a definition, a product comparison, a travel recommendation — a commoditized model handles it fine. Google, Amazon, and others have already built this tier. It does not require the latest frontier model. It requires something fast, cheap, and good enough. That bar is being cleared by a widening range of models, including open-weight ones.
The same shift is coming to reference material. Searching a technical manual is not just keyword matching — it requires understanding context and intent. Books and documentation will likely come with question-and-answer capabilities built in. That tier of AI, too, will commoditize.
This is a real and large market. The economics favor it, the use case demands it, and users benefit from it.
The new search is commodity AI. Still, people over time may demand better answers — just as any camera was fine for a mobile phone fifteen years ago, and now they require professional-level resolution.
These Are Not Your Father’s Models
Training runs now cost on the order of hundreds of millions of dollars. They require massive compute clusters, specialized infrastructure, and tightly coordinated teams. More importantly, success depends on a growing body of tacit knowledge: how to curate data, stabilize training at scale, and design systems that improve reliably with size.
This begins to resemble industries that have historically resisted commoditization.
Consider commercial aircraft manufacturing. The physics is well understood. Many countries have the talent and capital to attempt it. Yet only a handful of firms—Airbus and Boeing—can reliably produce modern aircraft. The barrier is not just cost, but decades of accumulated experience, integrated systems, and unforgiving validation cycles. China, with its massive manufacturing base, deep engineering talent, and state resources, has spent decades trying to build a competitive commercial aircraft industry. It still relies on Airbus and Boeing for the aircraft its airlines fly.
Frontier AI may be closer to these industries than to traditional software.
The Open Source Counterargument
There is, however, a powerful counterargument grounded in history.
Open source has successfully commoditized some of the most complex software systems ever built. The Linux kernel runs much of the world. Modern compilers like LLVM rival or exceed proprietary alternatives. These systems are maintained by thousands of contributors and represent decades of cumulative expertise.
If such systems can be commoditized through distributed collaboration, why not frontier AI?
This is the right question—but it misses a key difference.
Compilers and operating systems are fundamentally design problems. Once the architecture is understood, improvements can be developed incrementally, tested locally, and integrated by a distributed community. The cost of verifying a contribution is relatively low.
Frontier AI is increasingly a training problem at scale.
Improvements often require running large, expensive experiments. Validation is not a unit test; it is a training run that may cost millions and take weeks. The system is tightly integrated, and small changes can have unpredictable effects. This makes it difficult to distribute innovation across a broad base of contributors.
Open source scales code. It struggles to scale expensive experimentation.
The EDA Analogy
Consider electronic design automation — EDA tools like Synopsys and Cadence produce software used to design advanced semiconductors. They are similar to compilers like LLVM in design and complexity. In principle, EDA tools should be excellent candidates for open-source development. In practice, they are dominated by a small number of firms—those two alone hold roughly 74% of the market, with high retention and recurring revenue that has persisted for decades.
The reason is not technical complexity alone. It is the cost of validation and deep integration with cutting-edge fabrication processes.
Google has been attempting to build open source EDA tooling for years. The effort has made progress, but remains validated primarily at 28nm process nodes — while the leading commercial fabs are operating at 3nm and pushing toward 2nm. That is roughly seven generations behind: 28nm → 20nm → 16/14nm → 10nm → 7nm → 5nm → 3nm. Each generation represents years of development and billions in R&D. That gap illustrates the problem precisely: it is not that the open source tools cannot write code. It is that closing the last distance requires deep, expensive integration with proprietary manufacturing processes that open source cannot easily replicate.
A bug in an EDA tool can result in a failed chip, with losses measured in millions or billions of dollars. Testing and verification are expensive and slow to iterate. As a result, these systems require extremely high reliability and deep integration with external constraints.
Training runs are expensive, failures are costly, and validation requires full-scale experiments rather than incremental tests. This makes the system difficult to modularize and even harder to improve through distributed contribution.
The Pipeline, Not the Model
Perhaps the most important shift in thinking is this: frontier AI is not an isolated app—it is a pipeline.
From the outside, it is easy to focus on the model itself: a transformer trained on vast amounts of data. The architecture is widely known. Open-source implementations exist. It can appear that success is primarily a function of scale.
But in practice, performance differences between leading models suggest something more complex.
Users consistently report that some models are more reliable, more coherent, better at reasoning, or simply “feel smarter” in ways that are difficult to capture in benchmarks. These differences persist even when architectures appear similar and training approaches are broadly understood.
This gap points to the importance of the end-to-end process used to build and refine models.
That process includes data sourcing, filtering, and weighting; experiment design and evaluation; training stability and optimization; infrastructure and throughput; post-training alignment and fine-tuning; and feedback loops from real-world usage.
Each component matters. More importantly, they interact.
A change in data affects optimization. A change in optimization affects stability. A change in architecture affects scaling behavior. The system is tightly coupled, and improvements emerge from how these elements work together.
This is not easily reducible to a set of published techniques.
Much of the advantage lies in tacit knowledge: lessons learned through failed experiments, subtle tradeoffs, and accumulated experience. This knowledge is expensive to acquire and difficult to transfer.
The Benchmark Problem
Benchmarks are the primary tool used to measure AI progress. They are also increasingly unreliable as a guide to real-world performance.
The problem is structural. Benchmarks measure what they measure — a defined set of tasks, evaluated in a defined way. Labs know what the benchmarks are. Training and post-training processes can be tuned, intentionally or not, to perform well on them. Over time, benchmark scores converge even when practical capability gaps persist.
Consider what a typical coding benchmark actually tests. Asking a model to write Pong, or solve a self-contained algorithmic problem, tells you something — the model can produce working code for a bounded task with a known solution that exists throughout its training data. But real programming is long-term and cumulative. It requires sustaining coherent architectural decisions across a large codebase, handling compounding complexity as systems grow, and recovering gracefully when something breaks deep in a structure the model itself built. Benchmark tasks are essentially Pong. They measure whether the model knows what code looks like. They say very little about whether it can build, develop, and maintain something real.
This creates a systematic illusion of commoditization.
Anthropic’s annualized revenue run rate reached $30 billion as of early 2026 — a roughly 14x increase from a year earlier. The number of enterprise customers spending over $1 million annually doubled in just two months. That is not the behavior of a market that has concluded the models are interchangeable.
A Personal Benchmark
Abstract arguments about benchmark reliability are one thing. Here is a concrete test.
I have spent roughly fifty years in computing and software, including significant work on compilers. To put the models to a real test, I set out to build a working C++ compiler at the C++20 standard — entirely AI-written, with me providing high-level direction but writing no code and doing no debugging myself. The task is not trivial. A modern C++ compiler is one of the most complex software systems a developer can attempt. The language specification runs to thousands of pages. The edge cases are unforgiving.
My company had already accumulated significant experience with Claude Code by this point and had made it our agentic tool of choice. But I started this project with Gemini, since at the time the benchmarks rated it at or near the top and Google was offering a good price to upgrade to their higher level subscription. After a week or so, it stalled. It got stuck on certain problems and went around in circles, making no progress while burning through all the compute time my pro subscription allocated. I would have had to debug the problems myself. It was getting nowhere. I switched to Claude Code and the project has proceeded for months now with no issues. In fairness, these models are improving rapidly — the same experiment run today might yield a different outcome.
To validate the work, I had it build a language verification suite — now at approximately 30,000 tests, which Claude Code assessed as comparable in coverage to commercial suites like Plum Hall. The compiler currently passes more than 60% of those tests — and that figure includes complex template processing and other notoriously difficult parts of the C++20 specification, not just the simpler conformance checks. It is also able to handle the Standard Template Library, which rules out any characterization of this as a toy implementation. Getting to 100% is not a technical barrier; it is a question of how much time I can allocate to a side project while running other work.
This is one experiment, one person, one task. It does not settle the question of which model is best across all use cases. But it is not a casual test either.
One frontier model could not do the task. The other could. That’s all we need to know here.
That gap does not show up in the benchmarks. It showed up in the work.
You may have heard of a C compiler that Anthropic built using Claude. It was a C compiler, not C++20. The difference matters — as the complexity note below explains.
For readers unfamiliar with compiler engineering: despite the similar names, a C++20 compiler is conservatively at least 100 times more complex than a C compiler — and likely more. C is a small, stable language. C++20 is one of the most specification-heavy, edge-case-dense targets in all of software development.
The Role of Liability
In high-stakes domains—medicine, law, software development—errors are costly, delayed, and hard to detect. A misdiagnosis may not surface for months. A legal argument built on a hallucinated precedent may not fail until it reaches a judge. A coding error in a financial system may sit undetected until it becomes a breach. When mistakes are cheap and visible, good enough wins. When they are expensive and invisible until they aren’t, you buy quality.
The accuracy math is asymmetric. In high-liability applications, the difference between a model that is 98% accurate and one that is 99.9% accurate is not a 1.9% improvement — it is the difference between a functional product and a lawsuit. That gap does not show up in most benchmarks. It shows up when something goes wrong.
Regulation reinforces this. Compliance frameworks like the EU AI Act impose documentation, auditing, and accountability requirements that only well-resourced organizations can absorb — giving regulated industries another reason to stay with providers who can demonstrate compliance, not just capability.
The Self-Improving Wildcard
There is a more radical possibility: the models themselves may eventually handle their own optimization.
Deep reinforcement learning has already demonstrated that AI systems can discover solutions that humans missed entirely, even after centuries of effort. AlphaGo did not just learn to play Go — it discovered strategic principles that human masters, after more than a thousand years of study, had never found. It found them by playing against itself at scale, unconstrained by human assumptions about how the game should be played.
If that same dynamic were applied to AI training pipelines — tuning data selection, stabilizing training dynamics, discovering more efficient architectures — the accumulated human expertise that currently constitutes the moat becomes less decisive. Whoever gets there first could compress decades of process knowledge into months. That changes the valuation question entirely: it is not just whether the current cost structure persists, but whether human expertise remains the binding constraint — or eventually gets automated away by the same technology those labs are building.
The Algorithmic Wildcard
The cost structure that underpins the moat argument is not fixed. Transformers have a fundamental scaling problem: the self-attention mechanism is O(n²) in sequence length. Double the context, quadruple the compute.
Research into linear attention, sparse attention, and state space models is already attempting to break that wall — pushing toward O(n log n) or better. If any of these succeeds at frontier quality levels, the cost per experiment drops structurally. Computer vision is the precedent: algorithmic improvements didn’t just reduce costs incrementally, they changed the economics of the problem entirely. The same kind of leap is plausible in language models.
Investors underwriting IPO valuations that assume the current cost structure persists for a decade are making a bet on architectural stagnation. That is worth being explicit about.
The IPO Question
This debate is about to leave the realm of theory. OpenAI’s most recent funding implied a valuation of roughly $852 billion; Anthropic’s Series G placed it at $380 billion, with reported annualized revenues of $24 billion and $30 billion respectively. Public market investors will be asked to bet on which version of this story is true. For some related articles, see: What Is Anthropic Worth? and What Is OpenAI Worth?.
The moat question is the valuation question. If commoditization wins, these companies are priced like utilities at best. If the oligopoly thesis holds, they may be the Boeings and TSMCs of the next decade — ugly unit economics today, durable position tomorrow. DeepSeek produced a model competitive with frontier systems for an estimated $5-6 million in compute — a striking data point, though one with constraints and caveats that are still being debated. It does not settle the question. It does establish that the question is open.
Conclusion
The answers to these questions need to be tracked for those trying to value investments in frontier companies. There are not any clear answers at the moment.
That said, the evidence available today leans toward the oligopoly thesis holding at the top of the stack — at least for now. Enterprise adoption is accelerating toward the frontier, not away from it. The performance gaps that matter in professional work are larger than benchmarks suggest. But the wildcards that could break the moat — algorithmic breakthroughs, self-improving systems — remain genuinely uncertain rather than imminent.
The floor is commoditizing. The ceiling is moving up faster. Investors who treat those as the same market may misprice both.

