Rising Opt-Out Trend Among Major Publishers
Many prominent publishers and social platforms are opting to exclude their data from Apple’s AI training.
This development comes less than three months after Apple introduced Applebot-Extended, a tool designed to give website owners the ability to opt out of having their data used to train Apple's AI models.
High-profile entities such as Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, and WIRED’s parent company, Condé Nast, have taken advantage of this option.
The New York Times was among the first to block.
This significant reaction showcases a growing conflict over the use of web data in training AI systems and highlights a shift in the perception of web crawlers, which have traditionally been employed to gather information for various internet services.
The Evolution of Applebot and the Emergence of Applebot-Extended
Applebot, originally launched in 2015, was designed to enhance Apple's search functionalities, including Siri and Spotlight.
However, as Apple's AI initiatives expanded, so did the purpose of Applebot.
The data it collected began being used to train Apple’s foundational AI models.
To address concerns from publishers and content creators about how their data was being utilised, Apple introduced Applebot-Extended.
This new extension allows website owners to specifically request that their data not be used for AI training purposes.
Unlike the original Applebot, which continues to crawl websites for search functions, Applebot-Extended focuses solely on data usage for AI projects.
Publisher Reactions and Data Insights
The reaction to Applebot-Extended has been significant, with many publishers opting to block it.
Data from Ontario-based AI-detection startup Originality AI shows that, as of last week, about 7 percent of high-traffic websites—primarily news and media outlets—were blocking Applebot-Extended.
This week, an analysis by Dark Visitors revealed that approximately 6 percent of websites had blocked the bot.
This relatively low percentage indicates that many website owners either do not yet perceive a conflict or remain unaware of the option to exclude Applebot-Extended.
Ben Welsh, a data journalist, found that just over a quarter of the news websites he surveyed were blocking Applebot-Extended.
This compares to 53 percent of news sites blocking OpenAI’s bot and nearly 43 percent blocking Google’s AI-specific bot, Google-Extended.
Welsh notes that the number of sites blocking Applebot-Extended has been "gradually moving" upward, suggesting increasing awareness and action.
Strategic Decisions and Partnerships
The decisions made by major publishers to block or allow Applebot-Extended often reflect broader strategic considerations.
Condé Nast, for instance, previously blocked OpenAI’s web crawlers but unblocked them following a recent partnership announcement.
This move suggests a business strategy where data access is negotiated as part of commercial agreements.
Vox Media has similarly opted to block Applebot-Extended and other AI scraping tools unless a partnership is in place, emphasising their intent to protect the value of their published content.
In contrast, The New York Times, which is currently engaged in a lawsuit against OpenAI over copyright issues, has criticised the opt-out nature of Applebot-Extended.
Charlie Stadtlander, NYT’s director of external communications, pointed out:
"As the law and The Times' own terms of service make clear, scraping or using our content for commercial purposes is prohibited without our prior written permission."
This stance highlights the ongoing debate over how content rights and AI training intersect.
How to Opt Out of Applebot-Extended
For website owners looking to opt out of Applebot-Extended, the process is straightforward.
First, locate or create the robots.txt file on your website.
To block Applebot, add the following lines:
User-agent: Applebot
Disallow: /
To specifically block Applebot-Extended, include:
User-agent: Applebot-Extended
Disallow: /
Lastly, save the file and upload it to the root directory of your website.
By doing this, Apple will not use your site’s data to train its AI models, though your content will still be accessible for search functions.
As Apple explains:
"Applebot-Extended does not crawl webpages. Webpages that disallow Applebot-Extended can still be included in search results. Applebot-Extended is only used to determine how to use the data crawled by the Applebot user agent."
This adjustment in the digital landscape reflects a broader debate over data rights and the evolving role of AI in content creation and distribution.
The future will likely bring further developments as publishers, tech companies, and AI developers navigate these complex issues.