Wikimedia is under stress from AI bots

Wikipedia’s servers are under stress due to persistent AI scraping, the Wikimedia Foundation said on Tuesday. Since January 2024, automated bots looking for AI model training data for LLMs have been consuming terabytes of data, increasing the foundation’s internet usage for multimedia content downloads by 50%. As we have previously described, this situation is well-known in the free and open source software (FOSS) community.

In addition to Wikipedia, the Foundation is home to websites like Wikimedia Commons, which provides 144 million media files under open licenses. Everything from school projects to search results has been powered by this content for decades. However, using bulk downloads, APIs, and direct crawling, AI businesses have significantly expanded automated scraping since early 2024 in order to feed their ravenous AI models. The exponential rise in non-human traffic has resulted in significant financial and technical expenses, frequently without the credit that keeps Wikimedia’s volunteer ecology afloat.

The effect isn’t hypothetical. According to the foundation, former US President Jimmy Carter’s Wikipedia page inevitably received millions of hits after his death in December 2024. When viewers streamed a 1.5-hour video of a debate from 1980 from Wikimedia Commons at the same time, however, the tension really increased. The spike temporarily maxed out several of Wikimedia’s Internet connections by doubling its typical network load. The incident exposed a more serious issue: bots scraping media at scale had already largely used up the basic capacity. Wikimedia programmers swiftly diverted traffic to ease congestion.

The FOSS community is becoming more and more accustomed to this behavior. Following similar instances of scraping that were reported by Ars Technica, Fedora’s Pagure repository restricted all connections from Brazil. Proof-of-work challenges were used by GNOME’s GitLab instance to filter out excessive bot access. After preventing AI crawlers, Read the Docs significantly reduced its bandwidth expenses.

According to Wikimedia’s internal analytics, open projects incur significant costs from this type of traffic. Bots explore obscure and less-visited pages, forcing Wikimedia’s main datacenters to serve them directly, in contrast to humans who typically examine popular and regularly cached content. Caching systems that are made for predictable, human browsing behavior are ineffective when bots are randomly traversing the entire archive.

Therefore, Wikimedia discovered that although bots only make approximately 35 percent of total visitors, they are responsible for 65 percent of the most costly demands to its core infrastructure. This disparity represents a significant technical insight: bot requests are much more expensive than human ones, and the difference quickly mounts.

Crawlers avoiding detection

This is made more challenging by the fact that many AI-focused crawlers don’t follow the rules. Some disregard directives in robots.txt. Some pretend to be human visitors by impersonating browser user agents. These strategies have become so widespread that they are forcing individual developers, like Xe Iaso, to take extreme precautions for their code repositories. Some even switch between home IP addresses in order to escape blocking.

As a result, Wikimedia’s Site Reliability team is always on guard. Every hour spent preventing traffic spikes or rate-limiting bots is time lost that could be used to help Wikimedia’s users, contributors, or technical advancements. Furthermore, the strain is not limited to content platforms. Wikimedia’s code review tools and bug trackers are examples of developer infrastructure that is regularly targeted by scrapers, further devoting resources and attention.

As time goes on, these issues resemble those in the AI scraping environment. Daniel Stenberg, the developer of Curl, has previously described how user time is being wasted by phony, AI-generated bug reports. Drew DeVault of SourceHut notes on his blog how bots do significantly more damage to endpoints, such as git logs, than human developers could ever require.

Open platforms are experimenting with technical solutions all across the Internet, including crowdsourced crawler blocklists (like “ai.robots.txt”), slow-response tarpits (like Nepenthes), proof-of-work challenges, and commercial products like Cloudflare’s AI Labyrinth. The technical discrepancy between the demands of AI training at the industrial scale and the infrastructure built for human readers is addressed by these methods.

Open commons at risk

Since Wikimedia recognizes the value of offering “knowledge as a service,” its content is provided under a free license. However, the Foundation makes it clear that “Our infrastructure is not free, but our content is.”

Under a new program called WE5: Responsible Use of Infrastructure, the organization is currently concentrating on systemic solutions to this problem. It poses important queries about how to direct developers toward less resource-intensive access strategies and how to set restrictions that are sustainable while maintaining transparency.

Bridging the gap between commercial AI development and open knowledge repositories is the problem. Many businesses use publicly available knowledge to train business models, but they don’t support the infrastructure that makes that knowledge available. The technical mismatch that results from this puts community-run platforms’ viability in jeopardy.

These problems might be fixed with more effective access patterns, pooled infrastructure funding, or specialized APIs if AI developers and resource providers worked together more effectively. Without this kind of useful cooperation, the platforms that have made AI progress possible would find it difficult to continue providing reliable service. Wikimedia is explicit in its warning: Access freedom does not equate to freedom from repercussions.

Source link