Almost every website and services are getting scraped at alarming rate, are Lemmy servers facing this issue?
Please share mitigations you’ve seen applied to this.
We made a post about our actions here
They don’t really need to scrape. They just have to set up their own federated instance and the ActivityPub protocol will willingly hand it all to them in a nicely parsable format.
slrpnk.net has an AI intercept called Anubis, fwiw
One link on your website leads to a neverending labyrinth of nonesense to slowly poison a LLM.
It’s very easy for any activitypub content to be scraped, all servers practically serve the content on a silver platter to any federated server.
I think lemmy content is scraped too, just how the whole web is beeing scraped. I do not have any proof for it though.
I have seen a user add a like anti-commercial AI license as a footer for every comment he writes lol
Those are truly useless to go against bad actors and is instead only annoying for the humans that read. And good actors with proper licenses won’t be scraping Lemmy, Reddit or Twitter.
You just cannot prevent it on Lemmy because if an instance places filters like Anubis, another will not. And it is not feasable to mandate every instance to do so. Also, this is an open platform by nature and there is no group or company that can mandate rules of access. As you are limiting non-humans, you might also be limiting real users with peculiar configurations or under heavy privacy middlewares.
The point (as I see it) is not so much to stop scraping as it is to prevent bots from effectively DDOS-ing web services. As others have said ActivityPub content is public and there are ways to get it without slamming instances with scraper bots.
It is, I saw claudebot and gptbot scraping my instance, made a post about it on fuckai, but i have blocked all these bots now and my instance is a lot faster.
Out of curiosity, I am not familiar with the stack that runs the behind the scenes at all for lemmy. Are you blocking IP ranges or something else?
I use this nginx extension, it has a lot of rules, they mix IP, user agent, etc, to block a response it seems. Like adblocking rules, but for bots.
If ai respected that, people would not need AI mazes.
I use this nginx extension.