Kuribo64
Views: 30,731,557 Home | Forums | Uploader | Wiki | Object databases | IRC
Rules/FAQ | Memberlist | Calendar | Stats | Online users | Last posts | Search
11-14-25 04:00 AM
Guest:

0 users reading Evil OpenAI web crawler | 1 bot

Main - Computers and technology - Evil OpenAI web crawler Hide post layouts | New reply


fruityloops
Posted on 08-07-23 04:34 PM Link | #101282
OpenAI is using crawlers to yank training data from your sites, without even knowing the licensing of the content they are stealing.

So if you want to prevent this, you can put the following in your robots.txt:

User-agent: GPTBot
Disallow: /

Alternatively, you can give it garbage data to fuck up their training data (example with nginx config):

if ($http_user_agent ~* "GPTBot") {
return 200 'asdlkfjsdjklfjsdlkfjsdkjfhgdfskjhgfd';
# realistically, you would put something more convincing than just a keyboard smash
}

HEYimHeroic
Posted on 08-11-23 03:05 AM Link | #101294
huh, i always wondered if there was a way to stop openai from vacuuming my website. thanks! this will be helpful.

____________________
who would come back from inactivity out of nowhere just to edit their signature? yeah, likely story.

Digital Cheese
Posted on 08-11-23 03:16 AM Link | #101296
Probably stupid question, but since I'm not hosting my site on my own servers and just using Neocities does it support robots.txt files or is that something I'd have to host myself otherwise for it to work? If it will work no matter what, I'm gonna add it to my website because I don't want OpenAI vacuuming this shit even if its just a static-website.

____________________
My Website

fruityloops
Posted on 08-11-23 10:53 AM Link | #101297
yep

Digital Cheese
Posted on 08-11-23 08:52 PM Link | #101298
Posted by fruityloops
Lets goooooooo, gonna add it to any future websites I create along with any existing ones that I have.

____________________
My Website

fruityloops
Posted on 08-20-23 06:16 PM Link | #101304
Oh and also, ByteDance does the same thing (they're like 90% of the bots on this website), they will cost you gigabytes of bandwidth per hour, so if you want to block those too, block 'Bytespider'

Digital Cheese
Posted on 08-21-23 09:40 PM Link | #101305
Posted by fruityloops
Oh and also, ByteDance does the same thing (they're like 90% of the bots on this website), they will cost you gigabytes of bandwidth per hour, so if you want to block those too, block 'Bytespider'
I'm pretty sure ByteDance is TikTok with a different name so if thats true, thats even better being able to block :D

____________________
My Website

Generic aka RSDuck
Posted on 08-21-23 09:50 PM Link | #101306
the most malicious (in a good way) thing to do would be to feed back text already generated by the model.

That is hard to filter and will subtely decreases its quality when trained with it.

Ralsei
Posted on 05-14-25 04:31 PM Link | #102207
And the AI scraper pandemic has since continued, spurred by more and more entering the ring of AI text regurgitation fueled by VC funding money. It also feels like the scrapers have become a lot more aggressive since then.

Since one of the things I host is a large archive of old levels for a sandbox game totalling tens of thousands of pages, a lot of aggressive crawlers will go absolutely insane on them. And rate-limits be damned when they are making requests from many different IP addresses.

I usually run my access logs through Goaccess to inspect them every now and then and look at what crawlers it picks up. Then add some new ones to my ever growing shitlist of crawlers in my Nginx config that scrape with no real purpose other than to be a nuisance (search indexers are generally fine, GoogleBot is a necessary evil but pretty well-behaved all things considered, Mastodon federation pings are also usually fine):

if ($http_user_agent ~* (AmazonBot|AhrefsBot|DataForSeoBot|SemrushBot|Barkrowler|Bytespider|DotBot|fidget-spinner-bot|my-tiny-bot|Go-http-client/1.1|ClaudeBot|GPTBot|Scrapy|heritrix|Awario)) {
return 403;
}

Poisoning the AI training sounds like a noble goal but I'm afraid it's probably futile, and telling them to go away entirely is probably more effective than feeding poison IMO. I've varied between what error code I want to give them from 402 Payment Required, nginx's special 444 which severs the connection immediately without response, and 406 Not Acceptable. Ended up with 403 since some other ones like 444 will just make it even more intrigued and keep retrying over again. A big 403 seems to make it lose interest and go away immediately.

Now there are also self-hostable things you can put in front of your website to challenge visitors such as Anubis, acting as a L7 DDoS mitigation that will give a PoW challenge for the browser to complete a la hashcash. But obviously that will also break things such as RSS feeds, APIs and generally is just a last resort for heavier webapps that just straight up cannot stay up in the current climate.

Most of the website stuff I host is written from the ground up by me and I still have that Acmlmboard lean and mean energy in me with fast page rendering times shown in the footer. But sometimes it still just becomes too much and it clogs up the available PHP-FPM workers or grinds the MariaDB database to a halt, so staying on top of the new scrapers that show up just generally makes sense. Seeing the likes of Kafuka and Kuribo64 which show all the bots on the online users page just shows how many IPs they are willing to burn through to collect their precious training data.

Staryu Trek
Posted on 05-14-25 06:11 PM Link | #102212
Posted by Staryu Trek
×
_


Posted by Ralsei
if ($http_user_agent ~* (AmazonBot|AhrefsBot|DataForSeoBot|SemrushBot|Barkrowler|Bytespider|DotBot|fidget-spinner-bot|my-tiny-bot|Go-http-client/1.1|ClaudeBot|GPTBot|Scrapy|heritrix|Awario)) {
return 403;
}
Yeah, I've seen Ahrefsbot and Semrush here too, but their sites said they only crawl the sites for search engine accessibility. Nothing sus I thought, but I was wrong apparently.



 "To boldly glitch where no one has glitched before" - Staryu Trek

 
Hover!
Posted by kikilxve
he's rly nice
  
SM64DS body horror (hacking fail)fail
  
Yeah
ok
Layout background by alpha rats_1 on Open Game Art
Sig background from dreamstime.com
Avatar Staryu model from Retromesh (edited)
Avatar background from space.com

Ralsei
Posted on 05-14-25 07:20 PM Link | #102214
Posted by Staryu Trek
Yeah, I've seen Ahrefsbot and Semrush here too, but their sites said they only crawl the sites for search engine accessibility. Nothing sus I thought, but I was wrong apparently.

Ahrefs and Semrush are those kinds of search engine optimisation tools for increasing traffic with keyword and marketing analysis. Semrush is paid and I'm obviously not going to pay for a SEO tool, so I'd never be able to access the data they collect anyways. Ahrefs I used to use as a fun thing on their free webmaster tier, but lost interest because there's honestly not much you can do with it anymore.

Most of these fancy SEO tools used to try to game things are more or less obsolete with the modern Google search algorithm anyway, and most advice nowadays just boils down to "write honestly and genuinely interesting content and they will come". And they can't really compete with Google Search Console for being able to get hard numbers on how many thousands of people ended up seeing my blog post about washing the funny IKEA plush shark the last three months (13 000, somehow), which generally ends up being all I want to know at the end of the day. (I don't blog entirely for the clicks but seeing the posts you least expect it to hit top rankings is fun nonetheless)

Their crawlers are a bit more benign I suppose, but the traffic becomes just noise I lump in with a lot more nefarious bots because they still don't serve any useful purpose like indexing for an actual search engine or populating link embeds.


Main - Computers and technology - Evil OpenAI web crawler Hide post layouts | New reply

Page rendered in 0.052 seconds. (2048KB of memory used)
MySQL - queries: 29, rows: 222/222, time: 0.015 seconds.
[powered by Acmlm] Acmlmboard 2.064 (2018-07-20)
© 2005-2008 Acmlm, Xkeeper, blackhole89 et al.