Many News Sites Are Blocking AI Web Crawlers, New Research Shows

Senior Advisor, Ethics & Journalism Initiative

March 21, 2024

Nearly 80% of top news organizations in the U.S. were blocking OpenAI’s web crawlers from scraping data from their websites at the end of 2023, while 36% were blocking Google’s artificial-intelligence crawler, according to new research by the Reuters Institute.

Globally, legacy print publications were most likely to block the AI crawlers, while “digital-born” outlets were the least likely, according to the research.

AI companies use web crawlers to gather, or “scrape,” data, prose, images, blog posts, tweets and other content from all types of websites in order to create a vast repository of examples from which their generative AI tools can fashion new content, like helping to produce responses to queries via an interface like ChatGPT.

News publishers are prohibiting web crawlers from accessing their websites for several reasons. The New York Times, for instance, wants to be paid for the use of their content to train AI, amid concerns of lost compensation if the model doesn’t source the information back to its websites. Other outlets are concerned that incorrect outputs by AI models being attributed back to their publications could damage trust in the news brand.

The Reuters Institute report suggests that news organizations may be blocking OpenAI more than Google at this time because OpenAI’s ChatGPT is “more prominent and widely used” than Google’s Bard/Gemini, or “it could be because the OpenAI Crawler was released first.” The report also noted that publishers may be “more cautious about blocking Google in case it affects their prominence in search results – even though there are separate crawlers for search and AI.”

You can read Richard Fletcher’s full report here on how organizations are blocking web crawlers and what the implications could be for news brands and AI companies in the future.