> “I don't understand why the onus is on me to add a new header to my sites opting out of this tool,” Eden said. “Please can you change the default behaviour so that it will only work on sites which set the X-Robots-Tag: YesAI?”
Not respecting robots.txt is idiotic and website maintainers are right to be pissed, but that's not a good faith proposal. People don't have to opt-in to Google crawling their site with a special header, Google just crawls everything unless you add noindex[1].
Because when the megacorporations were just corporations the web was young.
Now the web is old, and what was interesting and novel at the small scale is tedious and taxing at the large scale.
It's much the same as how it's now a death march to write a new standards-compliant browser, or how you can't really found a new country unless you want to start a bloody war or carve a chunk out of Antarctica.
Sometimes, the rules have changed by the time new things come along.
I guess how I feel is, if there are 50 new services that spring up overnight with different opt outs that would be annoying for people to deal with. If the website operator isn’t getting value out of the service they might just block you or put content behind an paywall and we all lose. =/
Also, the article claims the scraping generated so much traffic that it got reported to the owner as an attack. I’m not claiming anything other than what’s being reported. They should have at least throttled their application to reduce the load.
If the owner didn’t notice the scraping we might not even be having this conversation.
The established norm is that scrapers have to download robots.txt and support the standard robots.txt features, notably including `Crawl-Delay` which sets a rate limit. This is the established standard by which websites tell scrapers what the rules are for scraping them.
This tool is scraping sites, it has webmasters reporting actual disruption, it doesn't have robots.txt support. When people complained (eg in https://github.com/rom1504/img2dataset/issues/48), the author's stance was basically "PRs welcome". It looks like a third party recently contributed a PR to make it respect robots.txt (https://github.com/rom1504/img2dataset/pull/302), albeit without `Crawl-Delay` support, which is not merged yet.
I have seen the same thing with other recent AI tools (eg https://github.com/m1guelpf/browser-agent/issues/2) and I think it's important to defend the robots.txt convention and nip this in the bud. If a bot doesn't make a reasonable effort to respect robots.txt and it causes disruption, it's a denial-of-service attack and should be treated as such. No excuses.
Not only is this developer unhinged but they seem to have a pretty weak grasp on how HTTP works, which one would hope would stop them from writing HTTP-intensive applications.
I guess it’s a commentary on the AI gold rush and another product for Cloudflare to sell...
People restricting scraping was one of my predictions. While now it seems to be limited to images and their creators, I would expect also other parties to join to the blocking fest, including perhaps even newspapers and such (unless they are compensated somehow or something similar). This will have all kinds of interesting implications for AI companies, Internet search, and advertising.
Time for a 'bot motel. The simplest way to do that is to salt your site with links which 404 and rewrite the 404 handler to serve garbage pages with more links which also 404. It can be the simplest way to solve compute resource problems. For bandwidth, you need to do some kind of tarpit. You may also need to limit simultaneous conns from e.g. specific addresses.
I haven't done it specifically for images. I imagine a tool like this is crawling pages in order to find image links, although it could be using e.g. search engine results for this. Corrupting the data is also an option.
It's a blunt tool but if people don't honor robots.txt I don't need to concern myself with their pearl-clutching notion of morality. (I don't feel honor bound to list every possible evasion in robots.txt anyway.)
In the readme of the project it indicates headers by which sites can opt out. But right after shows a flag by which the scrapers can ignore such headers. Seems like the author of this is basically telling everyone to deal with it.
Reading the related GitHub issues the dev seems to just not understand HTTP or web crawling etiquette before you get into the “actually AI is good for creators” pitches. The damage is probably done - even if this gets fixed, unethical people building datasets will just use the old versions.
Seems pretty clear that it's meant to be malicious compliance with consent, with consent being automatically assumed unless you say no to this specific scrapper, as though there were even a reasonable chance millions of sites could possibly know about the exact tag.
Please make this tool “opt-in” by default - https://news.ycombinator.com/item?id=35681085 - April 2023 (41 comments)