Web cramp robots have become a non -bearable burden on the Wikimedia community because of their insatiable appetite for online content to form AI models.
Representatives of the Wikimedia Foundation, which oversees Wikipedia and similar community projects, say that since January 2024, the bandwidth spent of multimedia file requests has increased by 50%.
“This increase does not come from human readers, but largely from automated programs that scratch the Wikimedia Image Image Images Catalog openly under license to feed images with IA models,” said Birgit Mueller, Chris Danis and Giuseppe Lavagetto, from the Wikimedia Foundation in AA public post.
This increase does not come from human readers
“Our infrastructure is built to support sudden circulation peaks on the part of humans during high interest events, but the quantity of traffic generated by skyscrapers is unprecedented and has increasing risks and costs.”
According to Wikimedians, at least 65% of traffic for the most expensive content served by Wikimedia Foundation data centers is generated by bots, even if these software agents represent only 35% of the pages viewed.
This is due to the wikimedia foundation cache diagram which distributes popular content to regional data centers around the world for better performance. The bots visit the pages without respect for their popularity, and their less popular content requests mean that the equipment must be recovered from the basic data center, which consumes more IT resources.
The disadvantage of poorly compared robots was a common complaint in the past year among the computer infrastructure operating for open source projects, as the Wikimés themselves noted it by pointing our recent report on the issue.
Last month, SourceHut, a Git accommodation service, called over -demanding web robots that Snarf for AI companies. Diaspora developer Dennis SchubertIdixit repair site, and Reading Also opposed aggressive IA robots, among others.
Most websites recognize the need to provide a bandwidth to serve bot requests as a working cost, because these scripted visits help make the context easier to discover by indexing it for search engines.
But since Chatgpt has come online and the generator has taken off, robots have become more disposed of entirely contained entirely contained websites which are used to form AI models. And these models can end as a commercial competitors, offering the aggregated knowledge they have collected for subscription fees or free of charge. Either the scenario has the potential to reduce the need for the source website, or research requests that generate online advertising revenues.
The Wikimedia Foundation in its annual planning document of 2025/2026, as part of its Responsible use of infrastructure The section cites the objective of “reducing the quantity of traffic generated by scrapers by 20% when it is measured in terms of demand rate and 30% in terms of bandwidth”.
We want to promote human consumption
Noting that Wikipedia and its Wikimedia Commons multimedia repository are invaluable for the training of automatic learning models, the planning document known as “We must prioritize that we serve with these resources, and we want to promote human consumption and prioritize the support of Wikimedia projects and contributors with our scarce resources”.
How it must be carried out, beyond the targeted interventions already undertaken by the reliability engineers of the site to block the most flagrant robots, is left to the imagination.
While the concern concerning harvesting of abusive AI content has been a problem for some time, a number of tools have emerged to thwart aggressive robots. These include: data intoxication projects such as Chandelier,, Black morrelleAnd Artisan; and network -based tools, especially Kudurru,, Nepenthe,, Ai LabyrinthAnd Anubis.
Last year, when web dissatisfaction with AI caterpillars reached the main customers of IA – Google, Openai and Anthropic robots, among others – there were efforts to provide methods to prevent AI crawlers from visiting websites thanks to the application of robot.txt guidelines.
But these instructions, stored at the origin of websites so that they can be read when arriving from web robots, are not universally deployed or respected. This optional defensive protocol, if it is not done via a generic character to cover all the possibilities, continue when a name change is all that is necessary to escape a list of blocks. A common statement among operating websites is that bots driving poorly identifying themselves badly as Google Or another robot robot largely tolerated so that they are not blocked.
Wikipedia.org, for example, does not bother to block the AI channels of Google, Openai or Anthropic In its Robots.txt file. It blocks a certain number of robots deemed annoying for their penchant for whole sites of madness, but has not included participations in large companies of commercial AI.
The register Asked the Wikimedia Foundation why she did not prohibit crawlers more in -depth. ®