"It is a DDOS attack involving tens of thousands of addresses"
It is amazing just how distributed some of these things are. Even on the small sites that I help host we see these types of attacks from very large numbers of diverse IPs. I'd love to know how these are being run.
There are plenty of providers selling "residential proxies", distributing your crawler traffic through thousands of residential IPs. BrightData is probably the biggest, but its a big and growing market.
And if you don't care about the "residential" part you can get proxies with data center IPs for much cheaper from the same providers. But those are easily blocked
And how do you get those residential IP addresses?
Well, you just need people to install your browser extension. Or your proprietary web browser. Or your mobile app. Or your nice MCP. Maybe get them to add your PPA repository so they automatically install your sneakily-overriden package the next time they upgrade their system.
Anything goes as long as your software has access to outgoing TCP port 443, which almost nobody blocks, so even if it's being run from within a Docker container or a VM it probably doesn't affect you.
Bright Data specifically offers a sdk that app developers can use monetize free games. A lot of free games and VPN apps are using it. Check out how they market it, it's wild... - https://bright-sdk.com/
In the most charitable case it's some "AI" companies with an X/Y problem. They want training data so they vibe code some naive scraper (requests is all you need!) and don't ever think to ask if maybe there's some sort of common repository of web crawls, a CommonCrawl if you will.
They don't really need to scrape training data as CommonCrawl or other content archives would be fine for training data. They don't think/know to ask what they really want: training data.
In the least charitable interpretation it's anti-social assholes that have no concept or care about negative externalities that write awful naive scrapers.