News publishers restrict Internet Archive access over AI scraping fears, blocking bots to protect content, as concerns grow over unauthorized use of archived web data by AI companies despite the nonprofit’s mission to preserve digital history.
News publishers including The Guardian and The New York Times are restricting access to the Internet Archive’s crawlers due to growing concerns about AI companies using the nonprofit’s vast digital library to train large language models without authorization. The Internet Archive, which preserves web content through its Wayback Machine and public crawlers, has long been seen as a champion of open access to information. However, its role as a repository of billions of archived webpages has made it a potential backdoor for AI data scraping. The Guardian has taken proactive steps to limit the Internet Archive’s access to its published articles. According to Robert Hahn, head of business affairs and licensing, access logs showed frequent crawling by the Internet Archive, prompting the publisher to block its APIs and filter out article URLs from the Wayback Machine’s interface. While regional and topic pages remain accessible, the move aims to prevent AI firms from extracting content through the Archive’s structured data sources. Hahn noted that the Archive’s APIs are particularly concerning because they offer a ready-made, organized database—ideal for AI training—whereas the Wayback Machine’s raw snapshots are less easily exploited. The New York Times has also implemented a hard block on the Internet Archive’s crawlers, adding the bot identifier archive.org_bot to its robots.txt file in late 2025. The publication emphasized its commitment to protecting its intellectual property and ensuring that its journalism is used lawfully. “We believe in the value of The New York Times’s human-led journalism and always want to ensure that our IP is being accessed and used lawfully,” a spokesperson said. Other publishers are following suit. The Athletic, USA Today Co. (formerly Gannett), and several international outlets have blocked one or more Internet Archive bots through their robots.txt files. In a notable example, the Des Moines Register now displays a message on the Wayback Machine stating that the URL has been excluded. USA Today Co. reported blocking 75 million AI bots in September 2025 alone, with about 70 million originating from OpenAI. The Internet Archive has faced scrutiny beyond news publishers. In May 2023, the site went offline temporarily after a surge in automated requests from an AI company using Amazon Web Services, overwhelming its servers. The Archive later blocked the hosts and received a donation from the company after it apologized and stopped the activity. Despite these challenges, the Internet Archive continues to defend its mission. Founder Brewster Kahle warned that limiting access by publishers could erode public access to the historical record, undermining efforts to combat misinformation. The Archive has introduced rate-limiting, filtering, and security tools like Cloudflare to manage bulk access, but it does not currently block any specific bots through its robots.txt file. An analysis of 1,167 news websites revealed that 241 explicitly disallow at least one of four Internet Archive bots identified by the AI watchdog Dark Visitors. Most of these sites are owned by Gannett, which has implemented strict anti-scraping protocols. Alarmingly, 226 of the 241 sites also block Common Crawl, another major web archive, and nearly all disallow bots from OpenAI, Google AI, and Common Crawl. While the Internet Archive remains a critical resource for preserving digital history, its role in the AI era is increasingly contested. Publishers are caught between supporting open access and protecting their content from unauthorized use. As AI training data demands grow, the tension between preservation and control is likely to intensify.
