And the AI scraper pandemic has since continued, spurred by more and more entering the ring of AI text regurgitation fueled by VC funding money. It also feels like the scrapers have become a lot more aggressive since then.
Since one of the things I host is a large archive of old levels for a sandbox game totalling tens of thousands of pages, a lot of aggressive crawlers will go absolutely insane on them. And rate-limits be damned when they are making requests from many different IP addresses.
I usually run my access logs through
Goaccess to inspect them every now and then and look at what crawlers it picks up. Then add some new ones to my ever growing shitlist of crawlers in my Nginx config that scrape with no real purpose other than to be a nuisance (search indexers are generally fine, GoogleBot is a necessary evil but pretty well-behaved all things considered, Mastodon federation pings are also usually fine):
if ($http_user_agent ~* (AmazonBot|AhrefsBot|DataForSeoBot|SemrushBot|Barkrowler|Bytespider|DotBot|fidget-spinner-bot|my-tiny-bot|Go-http-client/1.1|ClaudeBot|GPTBot|Scrapy|heritrix|Awario)) {
return 403;
}
Poisoning the AI training sounds like a noble goal but I'm afraid it's probably futile, and telling them to go away entirely is probably more effective than feeding poison IMO. I've varied between what error code I want to give them from 402 Payment Required, nginx's special 444 which severs the connection immediately without response, and 406 Not Acceptable. Ended up with 403 since some other ones like 444 will just make it even more intrigued and keep retrying over again. A big 403 seems to make it lose interest and go away immediately.
Now there are also self-hostable things you can put in front of your website to challenge visitors such as
Anubis, acting as a L7 DDoS mitigation that will give a PoW challenge for the browser to complete a la hashcash. But obviously that will also break things such as RSS feeds, APIs and generally is just a last resort for heavier webapps that just straight up cannot stay up in the current climate.
Most of the website stuff I host is written from the ground up by me and I still have that Acmlmboard lean and mean energy in me with fast page rendering times shown in the footer. But sometimes it still just becomes too much and it clogs up the available PHP-FPM workers or grinds the MariaDB database to a halt, so staying on top of the new scrapers that show up just generally makes sense. Seeing the likes of Kafuka and Kuribo64 which show all the bots on the online users page just shows how many IPs they are willing to burn through to collect their precious training data.