Returning from a week in San Francisco, the heart of the AI ecosystem, participants in a WAN-IFRA study tour came away with a clearer vision of an emerging market for news content. The market is still sorting out winners and losers, and publishers are uncertain about the “buy side”: the scale of demand for their content, who will want it, how much they will pay for it, and which types of content will command premium prices. But it was clear publishers have a few investments they need to make now if they are to realise returns from this developing market.
Until recently, the AI content marketplace was the “Wild West” and “chaotic” due to the predominance of scraping without permission or knowledge and “opaque” and unsettled pricing, according to pioneers working to build a functioning marketplace. However, at the start of 2026, participants on our Study Tour were shown how a functioning AI content marketplace is beginning to take shape.
To take advantage of this emerging market, news publishers should prioritise:
- Managing automated bot traffic scraping.
- Cataloging and structuring content for delivery in machine-readable formats.
- Adapting to shifts in market demand, moving from generalised training data to specific data for fine-tuning or domain-specific applications, and grounding data for inference.
- Tracking the use of their content to understand how this market values different types of content and how it is priced.
But known unknowns still exist that publishers will need to navigate. Pricing precision is improving, but still messy. Marketplaces are developing but are still in the experimental phase. But collective action holds promise for the news media industry to gain leverage in discussions as the market develops.
1. Manage automated access
Table of Contents
To take advantage of this developing market for news content, publishers must manage the access that bots and scrapers have to it.
“If everyone could go into a shop without paying for it, no one would pay for it. This is still happening. There is a huge practice or illegal scraping of content even if it is paywalled,” Ana Jakimovska, the Head of AI strategy for Mediahuis, told the News in the Digital Age conference in London.
For its largest sites, Mediahuis blocks 100,000 bots and scrapers, she said and added that it has had a minimal impact, reducing traffic by only low single-digit percentages.
Content Delivery Networks (CDNs), including Cloudflare, Akamai and Fastly, offer services to help publishers manage traffic and how bots and scrapers access their content. In the traffic era, the value exchange was that search engines crawled sites and, in return, sent them traffic, which generated revenue via ad impressions.
“Last year, in July, we witnessed that the internet is changing. The quid pro quo of the search(-based) internet … has changed,” Sam Else, Cloudflare’s Senior Director of Strategic Partnership for Media, Creators and AI, told study tour participants.
Ten years ago, a search engine sent back one referral for every two crawls of a site. Now, Google sends back one referral for every five crawls, while Perplexity sends back one referral for every almost 155 crawls, according to Cloudflare’s Radar service. For Anthropic, that number is one referral for more than 28,000 crawls (at the time of our visit to Cloudflare in January 2026).
News publishers will only have an opportunity to negotiate terms if they manage access to their content. People Inc. struck a licensing deal with Microsoft last November, and the company’s CEO, Neil Vogel, says that using Cloudflare to control bot access to their content got the software giant to the table to become a launch partner for their Publisher Content Marketplace.
Another element of managing access to content are emerging protocols such as Really Simple Licensing (RSL) or IAB’s Content Monetisation Protocols (CoMP). They provide machine-readable licensing and payment instructions. They do not replace bot management services, and RSL can be integrated with CDN solutions.
Regardless of the solution, managing bot access is an imperative for news publishers. Without protections in place, they will never have the leverage to engage platforms in a discussion about monetisation.
Eckart Walther, Co-Founder of the RSL Collective, presenting the RSL content licensing protocol to WAN-IFRA’s San Francisco AI Study tour participants
2. Catalogue and structure your content for higher returns
The good news is that AI labs are beginning to see value in news media content.
“AI progress is bottlenecked by data, not models or compute,” Brooke Hartley Moy, the CEO and co-founder of Infactory, told study tour participants. Infactory works with content companies, including news publishers, to structure their archival content and make it machine readable to increase its value.
“AI builders’ scarcest input is the combination of licensed and annotated content. This curated, annotated information looks a lot like journalism.”
AI companies have “massive demand for (structured) content rather than raw content,” Hartley Moy said. Structured content is machine readable through API or JSON feeds. Companies like Infactory and Protege structure licensees content, including that from news publishers, for a cut of the revenue.
Publishers are increasingly thinking of a world where they have two audiences: humans and machines, which may be bots, scrapers or agents. Humans consume content in articles, videos, visualisations and audio, but the machines want streams of data.
“If people don’t realise that structured data is a tablestake, they are already out of the game,” Madhav Chinnappa said. Chinnappa has worked at the AP, the BBC and spent 13 years at Google. He is currently at the Reuters Institute studying the relationship between foundation models and news publishers.
Brooke Hartley Moy, CEO and Founder, Infactory, speaking to the WAN-IFRA San Francisco AI Study tour 2026
3. Adapt to shifting demand from AI labs
In addition to making data machine readable, other factors that drive premium pricing include:
- Rarity – Unusual and difficult-to-find content has a higher price.
- Clear IP control – Do you own the content outright?
- Quality – Broadcast-quality video commands higher returns than mobile phone video, and one key consideration is that the video can include B-roll, not just the video that was broadcast.
- Domain-specific content – As the AI market develops, companies like Perplexity are focusing on specific high-value markets such as finance, law and elite sports.
- Continuity – content across a story or theme that develops over time.
By structuring the content into machine-readable formats and paying attention to these factors, Hartley Moy said that content creators can see the price paid for content by AI platforms increase by 10 to 30x.
The demand for some types of content is so high that audio, video and images that weren’t broadcast or published suddenly have value. News publishers need to catalogue and structure their archives and inventory of the multimedia content that haven’t been distributed to calculate if they have an untapped revenue stream.
“We’re increasingly looking beyond language models”, says Hartley Moy who believes that “the next wave is more about world models (how the real world works) and domain-specific models (e.g. healthcare, legal, financial, climate) – and that’s an area where publishers are uniquely well positioned to contribute. You’ve already made the investment; it’s inherently your business. You don’t need to change what you do. The real question is: how do you make those assets usable for the people who need them?”
As of early 2026, the consensus in Silicon Valley is that the battle in the AI market is shifting from general models to lucrative verticals that serve enterprise customers. For instance, Perplexity is positioning itself as the model for decision-making professionals.
“Perplexity is built for investors, lawyers, doctors, elite athletes and journalists – it’s not built for everybody. It’s for people who can’t afford to be wrong. They seek accuracy, truth, and low-friction answers,” Jessica Chan, Head of Content and Publisher Partnerships at Perplexity, told us. This typifies the pivot to focus on high-value niche enterprise applications that AI labs are executing in 2026.
This is driving a shift in demand from general content for training to domain-specific content and material to fine-tune models, Dave Davis, Chief Content Officer for Protege told the study tour participants. For example, they recently had a request for 10,000 video clips of helicopters to fine-tune a video model.
They work with 150 media partners to structure video and audio content, and they have 400,000 hours of video stored on Amazon’s cloud, he said. Protege doesn’t charge licensees for structuring their content, but it does take a cut of the revenue, usually around 35%, for the licensing deals it strikes with AI labs. The company has moved away from licensing the content in perpetuity to 20-year licenses.
Demand is also increasing for grounding or inference data that is used in RAG (retrieval-augmented generation). Most journalists would see grounding data as the facts found in their reporting.
4. Pricing for training data is starting to resolve
Another sign of the maturing market is greater clarity in pricing, at the very least for the kind of audio and video content Protege manages. When Davis started working in the sector two and half years ago, the pricing was all over the place. Then pricing “could be give or take 1000% (or a factor of 10),” he said, but added now it is about 50%.
Not only has precision improved but so have returns as part of the shift from training to fine-tuning and specialist data. In the early days, AI companies were paying $1 to $2 per minute of video, and with the shift in data demand, values have increased 5 to 25-fold, Davis said.
And the more companies like Infactory and Protege structure the data, the higher the value of the deals. A recent Protege deal delivered roughly $1.3m to content licensees, though the clips requested might have involved thousands of licensees.
As Hartley Moy of Infactory said, the more AI companies can offload the work of structuring data, the more they will pay for it. When making choices about which AI marketplaces to engage with, publishers will need to make sure they have the data about content demand by AI players to understand this emerging market.
An emerging multi-layered market
WAN-IFRAs AI in Media lead Ezra Eeman says this emerging market is multi-layered. Elements of this multi-layered market include:
- How the data is being used – Is the data being used for training, and whether that training is for fine-tuning of existing models or for training of high-value, domain-specific training? Is the data being used for grounding data?
- The compensation model – The type of data and its use are one factor in the compensation model, which includes pay-per-crawl, pay-per-use and long-term licensing.
- A hybrid market – This is a feature of this developing market. Rampant unauthorised scraping still occurs by data brokers and unidentified agents, which are listed as “user agent unknown” by CDNs and bot management companies. Bilateral deals between AI labs and the largest news media players in any given market dominated the initial rush of general training. This is giving way to emerging marketplaces and collective industry action.
As the market has developed, a couple of models have emerged based on how the data is being used: the tollbooth, companies like Tollbit that charge at the point of access, and warehousescompanies like Protege and Troveowhich aggregate, package and structure large amounts of content from multiple licensees, according to content licensing veteran Paul Melcher. The warehouses aggregate and structure content from licensees to AI labs, sharing a cut of the revenue with the content owners.
The tollbooth layer is based on a pay-per-crawl model, charging AI companies at the point they access data. Another model is based on pay-per-use, used by services like ProRata. The latter company shares half of revenue from its answer engine and other services, paying out each time a licensee’s content is attributed in an answer. It has major backers in the publishing business such as the UK’s dmg media and The Atlantic, and the company has licensing deals with dmg media, The Guardian and Sky News.
Most of what we have talked about is archival content. The next frontier in the AI Content market will be grounding data – real-time streams of facts that drive answer engines.
It is evident that the market for streams of structured data based on news content is not only maturing but becoming more attractive with the entry of major players such as Microsoft, which after piloting its Publisher Content Marketplace in 2025 with a select group of partners, announced an expansion in February 2026. Microsoft’s goal for the marketplace is to create “a low-friction, high-trust, scalable way to make content available for AI engines and a way for content publishers to get compensated for the premium content,” Nikhil Kolar, Vice President at Microsoft AI, told Digiday.
Microsoft’s PCM is designed to create a marketplace for data from content creators large and small. Microsoft says that “usage-based reporting” will help create pricing clarity. As part of the launch, Microsoft said that it had already signed up Yahoo as a buyer in its marketplace.
As Microsoft announced that PCM was launching out of its pilot phase, rumours were swirling that Amazon would be launching a similar marketplace soon.
The case for collective action by news publishers
Another shift that is gaining momentum has been from bilateral deals between AI companies and the largest publishers in any market towards collective action by groups of publishers or national associations.
The Danish Press Collective Management Organisation, which represents 99% of Danish news publishers, has been particularly active in striking collective deals. The organisation formed in 2021 and brought together digital news startups, traditional news organisations and the country’s public service broadcaster. The group has been striking licensing deals with Microsoft and ProRata for search and AI applications.
The call for collective action is sweeping across Europe. With options to create a functioning AI content marketplace clearer, news media leaders see options to define the marketplace rather than simply being impacted by the decisions of well-funded start-ups and tech giants. “The industry has to lay out how they want the market to function,” Douglas McCabe, the chief strategy officer of The Guardian, said.
The Guardian just announced the launch of Spur – the Standards for Publisher Usage Rights coalition – along with the BBC, Financial Times, the Telegraph and Sky News. “Artificial Intelligence is fundamentally reshaping how content is created, distributed, discovered and monetised. We believe we need to come together to protect original journalism and secure the long-term sustainability of our industry,” the leaders of the coalition partners wrote in a joint letter at the launch of the initiative.
The goal of Spur is to “establish shared technical standards and responsible licensing frameworks that ensure AI developers can access high-quality, reliable journalism in legitimate, responsible and convenient ways, while guaranteeing that publishers retain practical control of their content and receive fair value when it is used”.
Publishers see an opportunity to learn from the past and join together to respond to the incredible market power that trillion-dollar tech giants have. There is a new sense of optimism in opportunities provided by collective action.
Mediahuis’ Jakimovska said: “We need to unite. That is where I think we can win.”
