AI companies and notably AI scrapers are a cancer that is destroying what's left of the WWW.
I was hit with a pretty substantial botnet "distributed scraping" attack yesterday.
- About 400,000 different IP addresses over about 3 hours
- Mostly residential IP addresses
- Valid and unique user agents and referrers
- Each IP address would make only a few requests with a long delay in between requests
It would hit the server hard until the server became slow to respond, then it would back off for about 30 seconds, then hit hard again. I was able to block most of the requests with a combination of user agent and referrer patterns, though some legit users may be blocked.
The attack was annoying, but, the even bigger problem is that the data on this website is under license - we have to pay for it, and it's not cheap. We are able to pay for it (barely) with advertising revenue and some subscriptions.
If everyone is getting this data from their "agent" and scrapers, that means no advertising revenue, and soon enough no more website to scrape, jobs lost, nowhere for scrapers to scrape for the data, nowhere for legit users to get the data for free, etc.
Thanks for sharing the perspective here. I think a lot of folks on HN have rightly said that a lot of the problems with the modern internet are due to the ad-supported business model. I don't think you were ever going to move away from it voluntarily -- too many people support it, even if they grumble about it.
But maybe (and likely for worse) LLMs will finally kill this model.
I would love for the ad-supported model to die. I hate ads, and I hate having to serve ads. We get some subscription users but nowhere near enough to cover costs.
Unfortunately, what I think will happen - and indeed already is - is that the AI companies themselves will replace much of the WWW. Sites like the one I am talking about will cease to exist. AI companies, once they can no longer scrape (steal) the data will end up licensing the data themselves and replace us as the distributor to end users. Perhaps as a subscription add-on or also with an ad based model.
Which to some may be fine. Personally, I don't want a few centralized AI companies replacing the hundreds of thousands of independent websites online. Way too much centralized power there.
I much prefer having my thoughts distilled down into easily digestable and agreeable idioms that I can push around with absolute faith that they weren't just lies written by some PERSON on the internet.
Do you not run Anubis or have strict fail2ban rules? I just straight up ban IPs forever if they lookup files that will never exist on my servers. That plus Anubis with the strictest settings.
Fail2ban doesn't scale well to these volumes of traffic and request patterns.
Just like fail2ban is not very useful against a DDOS attack where each unique IP only makes a few requests with a large (hour+) delay in between requests. There is no clear "fail" in these requests, and the fail2ban database becomes huge and far too slow.
- 400,000 Unique IP addresses
- 1 to 3 requests per hour per IP addresses - with delays of over 60 minutes between each request.
- Legit request URLs, legit UA & referrer
Maybe Anubis would help, but it's also a risk for various reasons.
The more sophisticated bots run real headless browsers that anubis can't touch, and they only follow links that are actually visible on the page, so they wouldn't hit fail2ban.
They even sell access to proxy servers that successfully evade cloudflare captchas automatically.
What I don't understand is why a bot/scraper needs to load every page and image multiple times in the same hour or whatever session it's doing on my site. If I have say 10 pages and 100 images, surely 110 requests should be all it needs to load everything.
At some point there needs to be a check if it's a real human... But it's a cat and mouse game - any way we create to keep bots off gets a work around by clever engineers.
Hard disagree, it's very easy for a bot to use a credit card. And not only are card numbers often stolen, they're even given to teenagers these days, and can also be owned by businesses and exist entirely virtually... so I don't think you can assume the use of a credit card can always be tied to legitimate use by a single person.
Companies would offer all-you-can-DDoS plans at $20/bot per month if they could. Bots are only a problem to them because they prevent legitimate customers from handing over their credit card.
I was hit with a pretty substantial botnet "distributed scraping" attack yesterday.
- About 400,000 different IP addresses over about 3 hours
- Mostly residential IP addresses
- Valid and unique user agents and referrers
- Each IP address would make only a few requests with a long delay in between requests
It would hit the server hard until the server became slow to respond, then it would back off for about 30 seconds, then hit hard again. I was able to block most of the requests with a combination of user agent and referrer patterns, though some legit users may be blocked.
The attack was annoying, but, the even bigger problem is that the data on this website is under license - we have to pay for it, and it's not cheap. We are able to pay for it (barely) with advertising revenue and some subscriptions.
If everyone is getting this data from their "agent" and scrapers, that means no advertising revenue, and soon enough no more website to scrape, jobs lost, nowhere for scrapers to scrape for the data, nowhere for legit users to get the data for free, etc.