An AI Scraping Tool Is Overwhelming Websites with Traffic

dang · on April 25, 2023

Recent and related:

Please make this tool “opt-in” by default - https://news.ycombinator.com/item?id=35681085 - April 2023 (41 comments)

belval · on April 25, 2023

> “I don't understand why the onus is on me to add a new header to my sites opting out of this tool,” Eden said. “Please can you change the default behaviour so that it will only work on sites which set the X-Robots-Tag: YesAI?”

Not respecting robots.txt is idiotic and website maintainers are right to be pissed, but that's not a good faith proposal. People don't have to opt-in to Google crawling their site with a special header, Google just crawls everything unless you add noindex[1].

[1] https://developers.google.com/search/docs/crawling-indexing/...

Riverheart · on April 25, 2023

In their defense, most websites want to be indexed by Google.

throwawayadvsec · on April 25, 2023

"New services shouldn’t expect to be welcomed in the same way, "

Why exactly? So megacorporations can keep owning everything of value?

+I don't see how downloading all the images on a website is disruptive.

If you can't handle each of your images being downloaded once something is really wrong with your website

shadowgovt · on April 25, 2023

> Why exactly?

Because when the megacorporations were just corporations the web was young.

Now the web is old, and what was interesting and novel at the small scale is tedious and taxing at the large scale.

It's much the same as how it's now a death march to write a new standards-compliant browser, or how you can't really found a new country unless you want to start a bloody war or carve a chunk out of Antarctica.

Sometimes, the rules have changed by the time new things come along.

Riverheart · on April 25, 2023

I guess how I feel is, if there are 50 new services that spring up overnight with different opt outs that would be annoying for people to deal with. If the website operator isn’t getting value out of the service they might just block you or put content behind an paywall and we all lose. =/

Also, the article claims the scraping generated so much traffic that it got reported to the owner as an attack. I’m not claiming anything other than what’s being reported. They should have at least throttled their application to reduce the load.

If the owner didn’t notice the scraping we might not even be having this conversation.

taskforcegemini · on April 28, 2023

> being downloaded once

the issue isn't "downloaded once" but "downloaded at once". agressive crawlers often turn into dos attacks

jimrandomh · on April 25, 2023

The established norm is that scrapers have to download robots.txt and support the standard robots.txt features, notably including `Crawl-Delay` which sets a rate limit. This is the established standard by which websites tell scrapers what the rules are for scraping them.

This tool is scraping sites, it has webmasters reporting actual disruption, it doesn't have robots.txt support. When people complained (eg in https://github.com/rom1504/img2dataset/issues/48), the author's stance was basically "PRs welcome". It looks like a third party recently contributed a PR to make it respect robots.txt (https://github.com/rom1504/img2dataset/pull/302), albeit without `Crawl-Delay` support, which is not merged yet.

I have seen the same thing with other recent AI tools (eg https://github.com/m1guelpf/browser-agent/issues/2) and I think it's important to defend the robots.txt convention and nip this in the bud. If a bot doesn't make a reasonable effort to respect robots.txt and it causes disruption, it's a denial-of-service attack and should be treated as such. No excuses.

lexlash · on April 25, 2023

Not only is this developer unhinged but they seem to have a pretty weak grasp on how HTTP works, which one would hope would stop them from writing HTTP-intensive applications.

I guess it’s a commentary on the AI gold rush and another product for Cloudflare to sell...

antibasilisk · on April 26, 2023

Unhinged developers and AI; Sounds like a story we'll be seeing more and more

jruohonen · on April 25, 2023

People restricting scraping was one of my predictions. While now it seems to be limited to images and their creators, I would expect also other parties to join to the blocking fest, including perhaps even newspapers and such (unless they are compensated somehow or something similar). This will have all kinds of interesting implications for AI companies, Internet search, and advertising.

m3047 · on April 25, 2023

Time for a 'bot motel. The simplest way to do that is to salt your site with links which 404 and rewrite the 404 handler to serve garbage pages with more links which also 404. It can be the simplest way to solve compute resource problems. For bandwidth, you need to do some kind of tarpit. You may also need to limit simultaneous conns from e.g. specific addresses.

I haven't done it specifically for images. I imagine a tool like this is crawling pages in order to find image links, although it could be using e.g. search engine results for this. Corrupting the data is also an option.

It's a blunt tool but if people don't honor robots.txt I don't need to concern myself with their pearl-clutching notion of morality. (I don't feel honor bound to list every possible evasion in robots.txt anyway.)

darepublic · on April 25, 2023

In the readme of the project it indicates headers by which sites can opt out. But right after shows a flag by which the scrapers can ignore such headers. Seems like the author of this is basically telling everyone to deal with it.

weekay · on April 25, 2023

Why not rely on an robots.txt entry instead of having to explicitly include in http header to not have ai ?

lexlash · on April 25, 2023

Reading the related GitHub issues the dev seems to just not understand HTTP or web crawling etiquette before you get into the “actually AI is good for creators” pitches. The damage is probably done - even if this gets fixed, unethical people building datasets will just use the old versions.

edent · on April 25, 2023

Because - according to the developer - respecting robots.txt is unethical.

His contention is that denying content to AI tools deprives people of their right to better AI tools...

kordlessagain · on April 25, 2023

It’s a straw man argument, which gives you a good idea inside the psyche of the dev.

If anything picks up a URL and uses it later, that is definitely a web crawler.

onepointsixC · on April 25, 2023

Seems pretty clear that it's meant to be malicious compliance with consent, with consent being automatically assumed unless you say no to this specific scrapper, as though there were even a reasonable chance millions of sites could possibly know about the exact tag.

beaviskhan · on April 25, 2023

Probably because he knows doing so would make his life harder and give him less data to scrape.

sharemywin · on April 25, 2023

I'd also be curious what headers he sends like USER-AGENT

sharemywin · on April 25, 2023

that's what I was thinking too.