The Economic Engine of AI

Research

17 Oct

Written By S liao

“2026 will mark the year — when AI platforms compete less on capability, but more on attribution infrastructure — the ability to trace value back to source, and to compensate accordingly.”

Representation of the economic engine of AI

Every economic engine requires fuel. The steam engine ran on coal. The industrial assembly line ran on electricity. The internet ran on advertising revenue. Generative AI runs on data — specifically, data about genuine human knowledge and interactions.

Despite its short history, AI’s data economy has evolved rapidly. This article traces that evolution — from the dismantling of legacy ownership to the rise of service-distribution platforms — and explains why attribution infrastructure is becoming the core determinant of AI’s economic future.

The Economic Shift

Traditional valuation centered on legacy ownership: Disney’s copyright to Mickey Mouse, Universal’s distribution rights, AMC’s exhibition channels. Each layer commanded revenue through controlled licensing and distribution.

Generative AI severs that traceability. A parent can now prompt a model to convert a home video into a Mickey-like character — perhaps even with a Universal-style intro and an AMC-branded background. The result is instant, personalized, and arguably more meaningful. But it triggers no copyright claim, no licensing fee, no ticket sale. It bypasses every traditional beneficiary.

AI models have been trained on vast troves of books, artwork, music, and code scraped from the web — often including copyrighted material. A leaked list of websites scraped to train Meta’s AI models contained copyrighted content, pirated books, and adult media. This shortcut enables AI deployments to generate text, images, or music in the style of real creators, bypassing every traditional beneficiary.

Once such works are absorbed into training data, derivatives can be produced indefinitely. The chain of custody — and compensation — collapses. What matters now is not who owns yesterday’s catalog, but who retains the capacity to create relevance right here and now.

Legal Pushbacks

Authors, artists, and publishers have filed major lawsuits against AI firms for training on their works without consent. A group of book authors sued Anthropic after discovering it had downloaded approximately 7 million pirated books from shadow libraries to train its models. Facing more than $1 trillion in potential damages, Anthropic settled in 2025 to avoid trial.

Visual artists have brought similar suits against Stability AI, alleging billions of online images were scraped without compensation. Hollywood unions introduced AI guardrails in their 2023 agreements, protecting writers’ scripts and actors’ digital likenesses from unauthorized replication.

Collaborative Countermoves

More incumbents are choosing partnership over confrontation. Getty Images built an AI generator with Nvidia, trained exclusively on its licensed library — promising full rights protection and royalty payments to photographers whose images influenced the model. Shutterstock struck a deal with OpenAI to integrate DALL·E, launching a contributor fund to pay artists for content used in training. In music, Universal and Warner Music are finalizing licensing deals where every AI-generated song using an artist’s voice or composition triggers a micro-royalty, mirroring Spotify’s per-play model.

Meanwhile, Reddit — once a free source of conversational data — now charges for API access and earns roughly $60 million per year from Google’s licensing deal.

But these arrangements resemble a “final auction” for monetizing existing data before it becomes completely irrelevant — a last attempt to extract value from the pre-AI content era before synthetic and real-time data generation overtakes it.

Technological Battles

Technologically, infrastructure providers and startups are building digital barriers and forensic tools to regulate AI’s reach. Cloudflare introduced controls that allow websites to block or meter AI crawlers, trying to implement a pay-per-crawl model. Meanwhile, a new class of startups — including ones in YC’s recent batches — are using AI itself to scan publicly available generated content for potential patent or copyright infringement.

Yet the effectiveness remains uncertain. Once a model trained on copyrighted or near-copyright material is deployed, it delivers contents directly to consumers with minimal traceability. Plus the model can be additionally fine-tuned, distilled or stacked.

The Invisible Subsidy

Imagine ordering restaurant meals through Deliveroo — but paying far less than the price on the menu. How is this possible? The food was meant for someone who paid yesterday.

That’s how AI has operated: delivering content built on unpaid labor. Over the past decade, billions of users freely created content on blogs, forums, social media, Wikipedia, and fan sites — never imagining it would become AI’s raw material. OpenAI’s GPT models, for instance, trained on Common Crawl, a dataset that includes vast swaths of user-generated content. By 2019, the fanfiction site Archive of Our Own had published over 32 billion words — eight times the English text of Wikipedia — much of which ended up in training data. As one writer remarked, “If you’ve played with ChatGPT — congrats, you’ve used my work”.

In another case, an independent developer scraped 12.6 million fanfic stories and uploaded them to Hugging Face as an AI dataset, sparking outrage among authors who described it as “theft from a gift economy.” This pattern extends across domains: billions of images scraped for LAION-5B; open-source code used in Copilot; and unlicensed audio in early music AIs.

But those contents were never meant to be free inputs. Studios shared trailers to sell cinema tickets; illustrators posted portfolios to attract commissions; programmers open-sourced code to win projects. The implicit bargain: free samples in exchange for paid work.

That logic is reversed by AI. The more content creators publish online, the easier it is for consumers to access their service through AI, without paying them at all. What was once unpaid marketing has become the free fuel.

The only question now is: who pays for tomorrow’s meal?

Nothing is Free Forever

AI is voraciously data-hungry, and increasingly ready to pay for whatever feeds it.

Data annotation companies such as Scale AI, Surge AI and Mercor are now among the most valuable AI firms, providing high-quality labeled datasets that power frontier models. By late 2024, AI-generated text had surpassed human-written articles in volume, today it accounts for over 70% of new online material, a figure expected to exceed 90% by the end of 2025. As AI content saturates the web, new models risk being trained on synthetic data — effectively learning from their own output.

The Ouroboros Effect

Researchers call this recursive degradation the ouroboros effect — the snake eating its tail. A 2023 Nature paper demonstrated how repeated self-training causes “model collapse”: language models trained on synthetic text progressively lose fidelity, forget the tails of real data distributions, and generate increasingly bland or nonsensical content.

High-quality, human-generated material — particularly data from genuine interaction — is becoming the most valuable and scarce resource.

The Data Well Runs Dry

Major content platforms have begun charging for or restricting data access. Reddit’s API lockdown, Twitter’s API pricing overhaul, and Stack Overflow’s restrictions all point to the same trend: data scarcity. News organizations like the New York Times have blocked OpenAI’s crawler and sued to stop unlicensed use, arguing that AI models “compete with their sources” without sending traffic back.

OpenAI, Anthropic, and others are responding by signing paid licensing deals — with the Associated Press, Financial Times, and more — as the open-web pipeline closes. The era of free data is over. The next fuel is verified, compensated human input.

Vertical AIs

As data becomes specialized, so do the models. A major trend in 2025 has been vertical AIs — systems trained on proprietary data within specific industries.

Precision Over Scale

Vertical AIs typically achieve better performance at specialized industries. BloombergGPT with 50B parameters was trained on financial news, filings, and market data. It significantly outperforms general models on financial NLP tasks while maintaining parity on general benchmarks. In healthcare, Med-PaLM 2 achieved 85%+ accuracy on U.S. medical exam questions — on par with certified physicians.

In law, Harvey AI, built on GPT-4, was deployed by Allen & Overy, assisting 3,500 lawyers in tasks from due diligence to drafting. The firm reported tens of thousands of uses and measurable productivity gains. A Harvard/BCG study similarly found consultants using GPT-4 completed tasks 25% faster with 40% higher quality outputs.

New Ownership Models

Vertical AIs are often co-owned by data providers. Thomson Reuters acquired Casetext for $650M, merging its legal database with Casetext’s CoCounsel AI. Getty’s AI is built on its own dataset, paying royalties to contributors. These arrangements mark a departure from centralized AI ownership toward distributed, domain-specific ecosystems.

This co-ownership reshapes competition. SaaS firms once sold tools to enhance productivity; vertical AIs replace entire workflows. A SaaS salesperson convinces a manager to adopt software. A vertical AI salesperson convinces the CEO to replace the department.

For professionals, this means large-scale job displacement. For business owners, traditional moats — proprietary data, engineering talent, even speed — are eroding as AI agents assume end-to-end control of workflows.

Owning a strong and efficient distribution network remains an advantage — but only until the arrival of AI operating systems.

AI Operating System

AI platforms are now acting as operating systems — intermediating between users and services.

In October 2025, OpenAI released a full ChatGPT Apps SDK, letting third-party developers integrate directly into ChatGPT’s interface. Over 80 launch partners — including Expedia, Spotify, and Shopify — made ChatGPT not just a chatbot, but a unified service layer.

This shift redefines service distribution: from app store, search rankings, and franchisees to AI chats. Not just OpenAI, competitors are converging: Anthropic’s Claude integrates into Slack; Google Bard now accesses Gmail, Docs, Maps, and YouTube; Microsoft’s Copilot is built into Windows and Office.

For over 800 million active users, the experience will feel seamless. They can explore, compare, and transact different features without leaving the chat. For business, however, losing visibility inside this AI interface means losing relevance. Users will find your service only when the AI deems it relevant — and leaving your service will take no effort at all.

It’s natural to ask: Can vertical AIs extend their reach inside these centralized AI ecosystems?

For global premium brands, yes: because AI relies on probability, and those brands already dominate the training data. When a user asks “book a hotel in London,” the model is more likely to open Booking.com than a startup alternative.

For everyone else, service delivery will be far less predictable. Yet the battlefields remain vast — especially across personalized requests that fall outside major brands. A psychology-focused AI might answer “How can I make new friends at university after a gap year?” A healthcare AI might respond to “I’m feeling dizzy and short of breath — what should I do?” And often, it will be a combination of both, addressing complex, overlapping needs like: “I can’t make new friends after my gap year in university, and now I feel dizzy and short of breath.”

In effect, specialized vertical data is to be ported back into centralized AI platforms for monetization. To make such transactions possible, platforms need efficient and robust attribution infrastructure.

AI Attribution as Infrastructure

Attribution is like measuring the ingredients before baking: weighing sugar and flour before mixing, not tasting sweetness afterward. When each ingredient has a known cost, the whole product can be priced and shared fairly. Attribution is thus the foundation of the coming AI service economy.

It is also the bottleneck. When an AI platform onboards multiple third-party services, it need to have the capacity to attribute, so that each transaction can be honoured. For software, this is simple — usage is metered and billed per call.

But how about knowledge, ideas, or consultations? Should asking about treating cancer cost the same as asking about a common cold? What if a thousand healthcare AIs could answer — which one should be paid, and how much?

Technical Foundations

There has been research efforts to quantify attributions within AI models. Microsoft has launched a “training-time provenance” project to estimate which data points — photos, books, or conversations — most influenced a model’s outputs. Researchers at Carnegie Mellon, Adobe, and UC Berkeley developed algorithms that trace image influence in text-to-image models, quantifying which training images shaped a new output and to what degree.

But it’s a challenging direction. Transformer architectures were never designed to track such provenance (see our earlier article: How a Transformer Model Loses Attribution: A Step-by-Step Example).

Commercial Momentum

Commercially, attribution is still in its early phase — mostly achieved through up-front compensation or royalty pooling for data contributors.

Startups like Bria AI claim to “programmatically track and compensate” data contributors, while Adobe Firefly and Shutterstock’s AI already operate royalty pools that pay artists based on their training data’s use. Getty’s model likewise shares a pro-rata royalty with every photographer whose file contributes to generation. Mercor connects and compensate professionals across various fields to supply “expert data” to train and evaluate AI models at top AI companies.

By contrast, usage-based attribution — where compensation reflects how often or how significantly a contributor’s data is used in real outputs — remains far less developed. OpenMercury is among the first to develop such systems. The company’s Mercurise platform enables real-time traceability of AI-generated value and ensures its return to the original sources.

The Road Ahead

AI has been both wall-breaker and equalizer — democratizing creation, yet eroding traditional value structures. But the free data era is closing. From here, authentic human interaction becomes the scarcest and most valuable resource.

Setting aside the AGI dream that benefits only a select few, from this point forward, AI will increasingly serve as a transaction engine among services — creating new ways for individuals to earn and participate within the ecosystem itself. The shift is already underway. App launches that pay you to use AI are already outperforming those that charge you to use AI on Product Hunt.

2026 will mark the year — when AI platforms compete less on capability, but more on attribution infrastructure — the ability to trace value back to source, and to compensate accordingly. It will be the turning point from AI-driven job displacement to AI-enabled job creation.

At OpenMercury, we build the attribution layer of AI — empowering enterprises and individuals to convert their knowledge and ideas into active components of the AI service ecosystem. Contact us to get involved.

Suggested Citation

For attribution in academic contexts, please cite this work as:

Liao, S. (2025). The economic engine of AI. OpenMercury Research.

S liao