I use Anubis on my personal website, not because I think anything I’ve written is important enough that companies would want to scrape it, but as a “fuck you” to those companies regardless
That the bots are learning to get around it is disheartening, Anubis was a pain to setup and get running
We had a trust based system for so long. No one is forced to honor robots.txt, but most big players did. Almost restores my faith in humanity a little bit. And then AI companies came and destroyed everything. This is we can’t have nice things.
Big players are the ones behind most AIs though.
reminder to donate to codeberg and forgejo :)
Just provide a full dump.zip plus incremental daily dumps and they won’t have to scrape ?
Isn’t that an obvious solution ? I mean, it’s public data, it’s out there, do you want it public or not ?
Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhasthey won’t have to scrape ?
They don’t have to scrape; especially if
robots.txt
tells them not to.it’s public data, it’s out there, do you want it public or not ?
Hey, she was wearing a miniskirt, she wanted it, right?
No no no, you don’t get to invoke grape imagery to defend copyright.
I know, it hurts when the human shields like wikipedia and the openwrt forums are getting hit, especially when they hand over the goods in dumps. But behind those human shields stand facebook, xitter, amazon, reddit and the rest of big tech garbage and I want tanks to run through them.
So go back to your drawing board and find a solution the tech platform monopolist are made to relinquish our data back to use and the human shields also survive.
My own mother is prisoner in the Zuckerberg data hive and the only way she can get out is brute zucking force into facebook’s poop chute.
find a solution the tech platform monopolist are made to relinquish our data
Luigi them.
Can’t use laws against them anyway…
I think the issue is that the scrapers are fully automatically collecting text, jumping from link to link like a search engine indexer.
The Wikimedia Foundation does just that, and still, their infrastructure is under stress because of AI scrapers.
Dumps or no dumps, these AI companies don’t care. They feel like they’re entitled to taking or stealing what they want.
That’s crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.
They also have an open API that makes scraper entirely unnecessary too.
Here are the relevant quotes from the article you posted
“Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024.”
“At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots.”
“Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure.”
And it’s wikipedia ! The entire data set is trained INTO the models already, it’s not like encyclopedic facts change that often to begin with !
The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.
Maybe it’s consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.
If the internet wasn’t becoming a warzone, there really wouldn’t be a need for more than one scraper to scrape a site, even if the site was hostile, like facebook, it only need to be scraped once and then the data could be shared over a torrent swarm efficiently.
The problem isn’t that the data is already public.
The problem is that the AI crawlers want to check on it every 5 minutes, even if you try to tell all crawlers that the file is updated daily, or that the file hasn’t been updated in a month.
AI crawlers don’t care about
robots.txt
or other helpful hints about what’s worth crawling or not, and hints on when it’s good time to crawl again.Yeah but there’s would be scrappers if the robots file just pointed to a dump file.
Then the scraper could just do a spot check a few dozen random page and check the dump is actually up to date and complete and then they’d know they don’t need to waste any time there and move on.
Given that they already ignore robots.txt I don’t think we can assume any sort of good manners on their part. These AI crawlers are like locusts, scouring and eating everything in their path,
Crawlers are expensive and annoying to run, not to mention unreliable and produce low quality data. If there really were a site dump available, I don’t see why it would make sense to crawl the website, except to spot check the dump is actually complete. This used to be standard and it came with open API access for all before the silicon valley royals put the screws on everyone
Dunno, I feel you’re giving way too much credit to these companies.
They have the resources. Why bother with a more proper solution when a single crawler solution works on all the sites they want?Is there even standardization for providing site dumps? If not, every site could require a custom software solution to use the dump. And I can guarantee you no one will bother with implementing any dump checking logic.
If you have contrary examples I’d love to see some references or sources.
The internet came together to define the robots file standard, it could just as easily come with a standard API for database dumps. But decided on war since the 2023 API wars and now we’re going to see all the small websites die while facebook gets even more powerful.
Well there you have it. Although I still feel weird that it’s somehow “the internet” that’s supposed to solve a problem that’s fully caused AI companies and their web crawlers.
If a crawler keeps spamming and breaking a site I see it as nothing short of a DOS attack.Not to mention that
robots.txt
is completely voluntary and, as far as I know, mostly ignored by these companies. So then what makes you think that any them are acting in good faith?To me that is the core issue and why your position feels so outlandish. It’s like having a bully at school that constantly takes your lunch and your solution being: “Just bring them a lunch as well, maybe they’ll stop.”
My guess is that sociopathic “leaders” are burning their resources (funding and people) as fast as possible in the hopes that even a 1% advantage might be the thing that makes them the next billionaire rather than just another asshole nobody.
Spoiler for you bros: It will never be enough.
Okay what about…what about uhhh… Static site builders that render the whole page out as an image map, making it visible for humans but useless for crawlers 🤔🤔🤔
AI these days reads text from images better than humans can
Computer vision models can read/parse pixel geometry.
AI is pretty good at OCR now. I think that would just make it worse for humans while making very little difference to the AI.
The crawlers are likely not AI though, but yes OCR could be done effectively without AI anyways. This idea ultimately boils down to the same hope Anubis had of making the processing costs large enough to not be worth it.
OCR could be done effectively without AI
OCR has been neural nets even before convolutional networks emerged in the 2010s
Yeah you’re right, I was using AI in the colloquial modern sense. My mistake. It actually drives me nuts when people do that. I should have said “without compute-heavy AI”.
My mistake
hold on I am still somewhat new to Fedi & not fully used to people being polite
Do you know how trivial it is to screenshot a website and push it through an OCR ?
This battle is completely unwinnable, just put a full dumb.zip of the public data on the front door and nobody will waste their time with a scrapper.
Is the data public or is it not ? At this point all that you’re doing anyway is entrench the power of openai, google and facebook while starving any possible alternative.
Anubis will never work, no version of anubis will ever be anything more than a temporary speed bump.
Humans that don’t see:
Accessibility gets throw out the window?
I wasn’t being totally serious, but also, I do think that while accessibility concerns come from a good place, there is some practical limitation that must be accepted when building fringe and counter-cultural things. Like, my hidden rebel base can’t have a wheelchair accessible ramp at the entrance, because then my base isn’t hidden anymore. It sucks that some solutions can’t work for everyone, but if we just throw them out because it won’t work for 5% of people, we end up with nothing. I’d rather have a solution that works for 95% of people than no solution at all. I’m not saying that people who use screen readers are second-class citizens. If crawlers were vision-based then I might suggest matching text to background colors so that only screen readers work to understand the site. Because something that works for 5% of people is also better than no solution at all. We need to tolerate having imperfect first attempts and understand that more sophisticated infrastructure comes later.
But yes my image map idea is pretty much a joke nonetheless
Tech bros just actively making the internet worse for everyone.
I mean, tech bros of the past invented the internet
Those were tech nerds. “Tech bros” are jabronis who see the tech sector as a way to increase the value of the money their daddies gave them.
Nah, that was DARPA
Those are not the tech bros. The tech bros are the ones who move fast and break things. The internet was built by engineers and developers
Tech bros just actively making
the internetsociety worse for everyone.FTFY.
Increasingly, I’m reminded of this: Paul Bunyan vs. the spam bot (or how Paul Bunyan triggered the singularity to win a bet). It’s a medium-length read from the old internet, but fun.
I know this is the most ridiculous idea, but we need to pack our bags and make a new internet protocol, to separate us from the rest, at least for a while. Either way, most “modern” internet things (looking at you, JavaScript) are not modern at all, and starting over might help more than any of us could imagine.
Like Gemini?
From official Website:
Gemini is a new internet technology supporting an electronic library of interconnected text documents. That’s not a new idea, but it’s not old fashioned either. It’s timeless, and deserves tools which treat it as a first class concept, not a vestigial corner case. Gemini isn’t about innovation or disruption, it’s about providing some respite for those who feel the internet has been disrupted enough already. We’re not out to change the world or destroy other technologies. We are out to build a lightweight online space where documents are just documents, in the interests of every reader’s privacy, attention and bandwidth.
It’s not the most well thought-out, from a technical perspective, but it’s pretty damn cool. Gemini pods are a freakin’ rabbi hole.
I’ve personally played with Gemini a few months ago, and now want a new Internet as opposed to a new Web.
Replace IP protocols with something better. With some kind of relative addressing, and delay-tolerant synchronization being preferred to real-time connections between two computers. So that there were no permanent global addresses at all, and no centralized DNS.
With the main “Web” over that being just replicated posts with tags hyperlinked by IDs, with IDs determined by content. Structured, like semantic web, so that a program could easily use such a post as directory of other posts or a source of text or retrieve binary content.
With user identities being a kind of post content, and post authorship being too a kind of post content or maybe tag content, cryptographically signed.
Except that would require to resolve post dependencies and retrieve them too with some depth limit, not just the post one currently opens, because, if it’d be like with bittorrent, half the hyperlinks in found posts would soon become dead, and also user identities would possibly soon become dead, making authorship check impossible.
And posts (suppose even sites of that flatweb) being found by tags, maybe by author tag, maybe by some “channel” tag, maybe by “name” tag, one can imagine plenty of things.
The main thing is to replace “clients connecting to a service” with “persons operating on messages replicated on the network”, with networked computers sharing data like echo or ripples on the water. In what would be the general application layer for such a system.
OK, this is very complex to do and probably stupid.
It’s also not exactly the same level as IP protocols, so this can work over the Internet, just like the Internet worked just fine, for some people, over packet radio and UUCP or FTN email gates and copper landlines. Just for the Internet to be the main layer in terms of which we find services, on the IP protocols, TCP, UDP, ICMP, all that, and various ones and DNS on application layer, - that I consider wrong, it’s too hierarchical. So it’s not a “replacement”.
IP is the most robust and best protocol humanity ever invented. No other protocol survived the test of time this well. How would you even go about replacing it with decentralization? Something needs to route the PC to the server
Something needs to route the PC to the server
I don’t want client-server model. I want sharing model. Like with Briar.
The only kind of “servers” might be relays, like in NOSTR, or machines running 24/7 like Briar mailbox.
IP. How would I go about replacing it? I don’t know, I think Yggdrasil authors have written something about their routing model, but 1) it’s represented as ipv6, so IP, 2) it’s far over my head, 3) read the previous, I don’t really want to replace it as much as not to make it the main common layer.
client-server model. I want sharing model. Like with Briar
Guess what
Briar itself, and every pure P2P decentralized network where all nodes are identical… are built on Internet Sockets which inherently require one party (“server”) to start listening on a port, and another party (“client”) to start the conversation.
Briar uses TCP/IP, but it uses Tor routing, which is IMO a smart thing to do
I’m talking about Briar used over BT.
Even
AF_BLUETOOTH
sockets are… sockets, where one machine ("server’) opens to listen, and the other (“client”) initiates the stream
Won’t the bots just adapt and move there too?
Yep! That was exactly the protocol on my mind. One thing, though, is that the Fediverse would need to be ported to Gemini, or at least for a new protocol to be created for Gemini.
It shouldn’t be too hard, and considering private key authentication, you could even use a single sign-in for multiple platforms/accounts, and use the public key as an identifier to link them across platforms. I know there’s already a couple proof-of-concept Gemini forums/BBSs out there already. Maybe they just need a popularity boost?
If it becomes popular enough that it’s used by a lot of people then the bots will move over there too.
They are after data, so they will go where it is.
One of the reasons that all of the bots are suddenly interested in this site is that everyone’s moving away from GitHub, suddenly there’s lots of appealing tasty data for them to gobble up.
This is how you get bots, Lana
Is there nightshade but for text and code? Maybe my source headers should include a bunch of special characters that then give a prompt injection. And sprinkle some nonsensical code comments before the real code comment.
There are glitch tokens but I think those only effect it when using it.
Maybe like a bunch of white text at 2pt?
Not visible to the user, but fully readable by crawlers.
If a bot can’t read it, nor can a visually impaired user.
Well if it’s a prompt injection to fuck with llms you don’t want any users having to read it anyway, vision impaired or no.
You missed my point. A prompt injection to fuck with LLMs would be read by a visually impaired user’s screen reader.
I think the issue is that text uses comparatively very little information, so you can’t just inject invisible changes by changing the least insignificant bits - you’d need to change the actual phrasing/spelling of your text/code, and that’d be noticable.
Is there a migration tool? If not would be awesome to migrate everything including issues and stuff. Bet even more people woild move.
Codeberg has very good migration tools built in. You need to do one repo at a time, but it can move issues, releases, and everything.
There are migration tools, but not a good bulk one that I could find. It worked for my repos except for my unreal engine fork.
Question: those artificial stupidity bots want to steal the issues or want to steal the code? Because why they’re wasting a lot of resources scraping millions of pages when they can steal everything via SSH (once a month, not 120 times a second)
they just want all text
That would require having someone with real intelligence running the scraper.
Can there be a challenge that actually does some maliciously useful compute? Like make their crawlers mine bitcoin or something.
The Monero community spent a long time trying to find a “useful PoW” function. The problem is that most computations that are useful are not also easy to verify as correct. javascript optimization was one direction that got pursued pretty far.
But at the end of the day, a crypto that actually intends to withstand attacks from major governments requires a system that is decentralized, trustless, and verifiable, and the only solutions that have been found to date involve algorithms for which a GPU or even custom ASIC confers no significant advantage over a consumer-grade CPU.
Not without making real users also mine bitcoin/avoiding the site because their performance tanked.
Did you just say use the words “useful” and “bitcoin” in the same sentence? o_O
The saddest part is, we thought crypto was the biggest waste of energy ever and then the LLMs entered the chat.
At least LLMs produce something, even if it’s slop, all crypto does is… What does crypto even do again?
Monero allows you to make untraceable transactions. That can be useful.
The encryption schemes involved (or what I understand of them, at least) are pretty rad imo. That’s why it interests me.
Still, it’s proof of work, which is not great.
Sure, Monero is good for privacy-focused applications, but it’s a fraction of the market and the larger coins aren’t particularly any less tracable than virtual temporary payment cards, so Monero (and other privacy-centric coins) get overshadowed by the garbage coins.
Same with AI, where non-LLM models are having a huge impact in medicine, chemistry, space exploration and more, but because tech bros are shouting about the objectively less useful ones, it brings down the reputation of the entire industry.
It gives people with already too much money a way to invest by gambling without actually helping society.
Crypto does drug sales and fraud!
It also makes it’s fans poorer, which at least is funny, especially since they never learn
Blockchain m8 gg
ouch. I never made that comparison, but that is on point.
I went back and added “malicious” because I knew it wasn’t useful in reality. I just wanted to express the AI crawlers doing free work. But you’re right, bitcoin sucks.
To be fair: it’s a great tool for scamming people (think ransomware) :/
Great for money laundering.
Bro couldn’t even bring himself to mention protein folding because that’s too socialist I guess.
Hey dipshits:
The number of mouth-breathers who think every fucking “AI” is a fucking LLM is too damn high.
- Every artificial intelligence is not a deep neural network algorithm.
- Every deep neural network algorithm is not a generative adversarial network.
- Every generative adversarial network is not a language model.
- Every language model is not a large language model.
Fucking fart-sniffing twats.
$ ./end-rant.sh
LLMs can’t do protein folding. A specifically-trained Machine Learning model called AlphaFold did. Here’s the paper.
Developing, training and fine tuning that model was a research effort led by two guys who got a Nobel for it. Alphafold can’t do conversation or give you hummus recipes, it knows shit about the structure of human language but can identify patterns in the domain where it has been specifically and painstakingly trained.
It wasn’t “hey chatGPT, show me how to fold a protein” is all I’m saying and the “superhuman reasoning capabilities” of current LLMs are still falling ridiculously short of much simpler problems.
The crawlers for LLM are not themselves LLMs.
They can’t bitcoin mine either, so technical feasibility wasn’t the goal of my reply
You’re 100% right. I just grasped at the first example I could think of where the crawlers could do free work. Yours is much better. Left is best.
For mbin I managed to kill the attack of the scrapers only using cloudflare managed challenge for all except to fediverse post endpoints, from fediverse ua agents on certain get endpoints. Managed challenge on everything else.
So far, they’ve not gotten past it. But, a matter of time.
man, you’d think they’d just use the actual activitypub protocol to inhale all that data at once and not bother with costly scraping.
This A aint very I
Well the posts to inbox are generally for incoming info. Yes, there’s endpoints for fetching objects. But, they don’t work for indexing, at least not on mbin/kbin. If you have a link, you can use activitypub to traverse upwards from that object to the root post. But you cannot iterate down to child comments from any point.
The purpose is that say I receive an “event” from your instance. You click like on a post I don’t have on my instance. Then the like event has a link to the object for that on activitypub. If I fetch that object it will have a link to the comment, if I fetch the comment it will have the comment it was in reply to, or the post. It’s not intended to be used to backfill.
So they do it the old fashioned way, traversing the human side links. Which is essentially what I lock down with the managed challenge. And this is all on the free tier too.
Same for all the WordPress blogs, by default in all of them there’s an API without authentication that lets you download ALL the posts in an easy JSON.
Dear artificial stupidity bot… WHY THE FUCK ARE YOU FUCKING SCRAPING THE WHOLE PAGE 50 TIMES A SECOND???
AI was never intelligent. It’s a marketing term, that’s all. It has absolutely no meaning.
how this felt like while reading
Crazy. DDoS attacks are illegal here in the UK.
The problem is that hundreds of bad actors doing the same thing independently of one another means it does not qualify as a DDoS attack. Maybe it’s time we start legally restricting bots and crawlers, though.
So, sue the attackers?