It’s always a cat-n-mouse game.
Except previously bombarding another person’s server for personal gain was illegal.
I don’t know if this is news to you, but most of the internet never cared about what’s legal or not.
I knew that was the worse option. Use the one that traps them in an infinite maze.
You need to properly detect that they’re bots first and then they’ll just figure out how to spoof that. Then you’re back to square one.
Abstractly, POW doesn’t need to determine if you’re a bot or not. To make a request, as a human or bot, you need to pay in cpu-time. The hope is that the cost is not so high that a human notices very much but for a bot trying to hoover up data as fast as possible, the aggregate cost is high.
I think the more horrifying aspect is that they’ll just build ever bigger datacenters to crunch POW tests faster and the carbon cost will skyrocket even more.
Oh I haven’t even considered the carbon aspect. Anubis is an even worse idea than I previously thought…
Exactly. Imagine needing to pay a penny for every request. Not a huge deal for someone who only makes one or two requests per year. But if you’re running a bot farm and making tens of millions of requests per day, you’ll quickly find that your operating costs have skyrocketed. That’s basically the idea behind Anubis; Make someone pay in CPU time, so the legit users don’t really notice but bots quickly eat up all of their servers’ CPU.
Crazy. DDoS attacks are illegal here in the UK.
So, sue the attackers?
The problem is that hundreds of bad actors doing the same thing independently of one another means it does not qualify as a DDoS attack. Maybe it’s time we start legally restricting bots and crawlers, though.
how this felt like while reading
Can there be a challenge that actually does some maliciously useful compute? Like make their crawlers mine bitcoin or something.
Not without making real users also mine bitcoin/avoiding the site because their performance tanked.
Did you just say use the words “useful” and “bitcoin” in the same sentence? o_O
Bro couldn’t even bring himself to mention protein folding because that’s too socialist I guess.
You’re 100% right. I just grasped at the first example I could think of where the crawlers could do free work. Yours is much better. Left is best.
Hey dipshits:
The number of mouth-breathers who think every fucking “AI” is a fucking LLM is too damn high.
- Every artificial intelligence is not a deep neural network algorithm.
- Every deep neural network algorithm is not a generative adversarial network.
- Every generative adversarial network is not a language model.
- Every language model is not a large language model.
Fucking fart-sniffing twats.
$ ./end-rant.sh
LLMs can’t do protein folding. A specifically-trained Machine Learning model called AlphaFold did. Here’s the paper.
Developing, training and fine tuning that model was a research effort led by two guys who got a Nobel for it. Alphafold can’t do conversation or give you hummus recipes, it knows shit about the structure of human language but can identify patterns in the domain where it has been specifically and painstakingly trained.
It wasn’t “hey chatGPT, show me how to fold a protein” is all I’m saying and the “superhuman reasoning capabilities” of current LLMs are still falling ridiculously short of much simpler problems.
The crawlers for LLM are not themselves LLMs.
They can’t bitcoin mine either, so technical feasibility wasn’t the goal of my reply
I went back and added “malicious” because I knew it wasn’t useful in reality. I just wanted to express the AI crawlers doing free work. But you’re right, bitcoin sucks.
To be fair: it’s a great tool for scamming people (think ransomware) :/
Great for money laundering.
The saddest part is, we thought crypto was the biggest waste of energy ever and then the LLMs entered the chat.
At least LLMs produce something, even if it’s slop, all crypto does is… What does crypto even do again?
It gives people with already too much money a way to invest by gambling without actually helping society.
Monero allows you to make untraceable transactions. That can be useful.
The encryption schemes involved (or what I understand of them, at least) are pretty rad imo. That’s why it interests me.
Still, it’s proof of work, which is not great.
Sure, Monero is good for privacy-focused applications, but it’s a fraction of the market and the larger coins aren’t particularly any less tracable than virtual temporary payment cards, so Monero (and other privacy-centric coins) get overshadowed by the garbage coins.
Same with AI, where non-LLM models are having a huge impact in medicine, chemistry, space exploration and more, but because tech bros are shouting about the objectively less useful ones, it brings down the reputation of the entire industry.
Crypto does drug sales and fraud!
It also makes it’s fans poorer, which at least is funny, especially since they never learn
Blockchain m8 gg
ouch. I never made that comparison, but that is on point.
The Monero community spent a long time trying to find a “useful PoW” function. The problem is that most computations that are useful are not also easy to verify as correct. javascript optimization was one direction that got pursued pretty far.
But at the end of the day, a crypto that actually intends to withstand attacks from major governments requires a system that is decentralized, trustless, and verifiable, and the only solutions that have been found to date involve algorithms for which a GPU or even custom ASIC confers no significant advantage over a consumer-grade CPU.
I use Anubis on my personal website, not because I think anything I’ve written is important enough that companies would want to scrape it, but as a “fuck you” to those companies regardless
That the bots are learning to get around it is disheartening, Anubis was a pain to setup and get running
Is there a migration tool? If not would be awesome to migrate everything including issues and stuff. Bet even more people woild move.
Codeberg has very good migration tools built in. You need to do one repo at a time, but it can move issues, releases, and everything.
There are migration tools, but not a good bulk one that I could find. It worked for my repos except for my unreal engine fork.
We had a trust based system for so long. No one is forced to honor robots.txt, but most big players did. Almost restores my faith in humanity a little bit. And then AI companies came and destroyed everything. This is we can’t have nice things.
Big players are the ones behind most AIs though.
Eventually we’ll have “defensive” and “offensive” llm’s managing all kinds of electronic warfare automatically, effectively nullifying each other.
Places like cloudflare and akamai are already using machine learning algorithms to detect bot traffic at a network level. You need to use similar machine learning to evade them. And since most of these scrapers are for AI companies I’d expect a lot of the scrapers to be LLM generated.
That’s actually a major plot point in Cyberpunk 2077. There’s thousands of rogue AI’s on the net that are constantly bombarding a giant firewall protecting the main net and everything connected to it from being taken over by the AI.
The game is an excellent documentary.
Not to mention the firewall is itself AI.
Unrelated, but I saw this headline, and could hear both you and squidward swearing from here.
It doesn’t bode well. Honestly I fear at some point in the future, if these countermeasures can’t keep up, small sites may need to close themselves off with invite-only access. Hopefully that’s quite a distant future.
Obligatory AI ≠ LLM. How would scrapers benefit from the LLMs they help train? The defense is obvious, LLM-generated slop traps against scrapers already exist.
I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you’ve actively evading Anubis, fuckin’ game on.
Wasn’t this called black ice in Neuromancer? Security systems that actively tried to harm the hacker?
These crawlers come from random people’s devices via shady apps. Each request comes from a different IP
Yep
Is that really true? I guess I have no reason to doubt it, I just hadn’t heard it before.
Here’s one example of a proxy provider offering to pay developers to inject their proxies into their apps. (“100% ethical proxies” because they signed a ToS). Another is BrightData proxies traffic through users of their free HolaVPN.
IOT and smart TVs are also obvious suspects.
Most of these AI crawlers are from major corporations operating out of datacenters with known IP ranges, which is why they do IP range blocks. That’s why in Codeberg’s response, they mention that after they fixed the configuration issue that only blocked those IP ranges on non-Anubis routes, the crawling stopped.
For example, OpenAI publishes a list of IP ranges that their crawlers can come from, and also displays user agents for each bot.
Perplexity also publishes IP ranges, but Cloudflare later found them bypassing no-crawl directives with undeclared crawlers. They did use different IPs, but not from “shady apps.” Instead, they would simply rotate ASNs, and request a new IP.
The reason they do this is because it is still legal for them to do so. Rotating ASNs and IPs within that ASN is not a crime. However, maliciously utilizing apps installed on people’s devices to route network traffic they’re unaware of is. It also carries much higher latency, and could even allow for man-in-the-middle attacks, which they clearly don’t want.
Honestly, man, I get what you’re saying, but also at some point all that stuff just becomes someone else’s problem.
This is what people forget about the social contract: It goes both ways, it was an agreement for the benefit of all. The old way was that if you had a problem with someone, you showed up at their house with a bat / with some friends. That wasn’t really the way, and so we arrived at this deal where no one had to do that, but then people always start to fuck over other people involved in the system thinking that that “no one will show up at my place with a bat, whatever I do” arrangement is a law of nature. It’s not.
Or your TV or IOT devices. Residential proxies are extremely shady businesses.
Yes. A nonprofit organization in Germany is going to be launching drone strikes globally. That is totally a better world.
Its also important to understand that a significant chunk of these botnets are just normal people with viruses/compromised machines. And the fastest way to launch a DDOS attack is to… rent the same botnet from the same blackhat org to attack itself. And while that would be funny, I would also rather orgs I donate to not giving that money to blackhat orgs. But that is just me.
I think the best thing to do is to not block them when they’re detected but poison them instead. Feed them tons of text generated by tiny old language models, it’s harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn’t slow down real users, but you probably don’t need much power since the scrapers probably don’t really care about the speed
Yeah that was my thought. Don’t reject them, that’s obvious and they’ll work around it. Feed them shit data - but not too obviously shit - and they’ll not only swallow it but eventually build up to levels where it compromises them.
I’ve suggested the same for plain old non-AI data stealing. Make the data useless to them and cost more work to separate good from bad, and they’ll eventually either sod off or die.
A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s. It doesn’t need to be done real-time either as datasets can be generated in advance
A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s.
Even “high power” AIs would produce bad data. It’s currently well known that feeding AI data to an AI model decreases model quality and if repeated, it just becomes worse and worse. So yea, this is definitely viable.
Yup. It was more my thought that a low power over could produce sufficient results while requiring less resources. Something that can run on a desktop computer could still produce a database with reams of believable garbage that would take a lot of resources from the attacking AI to sort through, or otherwise corrupt its own harvested cache
I love catching bots in tarpits, it’s actually quite fun
Some guy also used zip bombs against AI crawlers, don’t know if it still works. Link to the lemmy post
The problem is primarily the resource drain on the server and tarpitting tactics usually increase that resource burden by maintaining the open connections.
The idea is that eventually they would stop scraping you cause the data is bad or huge. But it’s a long term thing, it doesn’t help in the moment.
The promise of money — even diminishing returns — is too great. There’s a new scraper spending big on resources every day while websites are under assault.
In the paraphrased words of the finance industry: AI can stay stupid longer than most websites can stay solvent.
Right
I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.
I think every company in the world is telling everyone for a few months now that what matter is AI data harvesting. There’s not even a hint of it being a question. You either accept the AI overlords or get out of the internet. Our ONLY purpose it to feed the machine, anything else is irrelevant. Play along or you shall be removed.
get out of the internet.
At some point, this would be the best option, sadly
We need to poison better.
I was fine before the AI.
The biggest customer of AI are the billionaires who can’t hire enough people for their technofeudalist/surveillance capitalism agenda. The billionaires (wannabe aristocrats) know that machines have no morals, no bottom lines, no scruples, don’t leak info to the press, don’t complain, don’t demand to take time off or to work from home, etc.
AI makes the perfect fascist.
They sell AI like it’s a benefit to us all, but it ain’t that. It’s a benefit to the billionaires who think they own our world.
AI is used for censorship, surveillance pricing, activism/protest analysis, making firing decisions, making kill decisions in battle, etc. It’s a nightmare fuel under our system of absurd wealth concentration.
Fuck AI.
If this isn’t fertile grounds for a massive class-action lawsuit, I don’t know what would be.
whos the defendent, specifically?
No, that’s a good point. We all bloody well know there isn’t a single provider of LLM’s that aren’t sucking the entire Internet dry while gleefully ignoring robots.txt and expecting everybody else to pay the bill on their behalf, but the AI providers are getting really good at using other people IPs both to mask their identity and to evade blacklists, which is yet another abusive behavior.
But that’s beside your point. So forget the class-action lawsuit in favor of the relevant Ombudsman.
Either way, this cannot go on. Donation-driven open source projects are being driven into the ground by exploding bandwidth and hosting costs, people are being forced to deploy tools like Anubis that eats additional resources - including the resources of every legitimate user. The cumulative damage this is doing is no joke.
I’m ashamed to say that I switched my DNS nameservers to CF just for their anti crawler service.
Knowing Cloudflare, god know how much longer it’ll be free for.Did you enable the AI black hole/tarpit? It’s the main reason I’ve used their stuff.
TIL! Just enabled it, thanks
There once was a dream of the semantic web, also known as web2. The semantic web could have enabled easy to ingest information of webpages, removing soo much of the computation required to get the information. Thus preventing much of the AI crawling cpu overhead.
What we got as web2 instead was social media. Destroying facts and making people depressed at a newer before seen rate.
Web3 was about enabling us to securely transfer value between people digitally and without middlemen.
What crypto gave us was fraud, expensive jpgs and scams. The term web is now even so eroded that it has lost much of its meaning. The information age gave way for the misinformation age, where everything is fake.
Web3 was about enabling us to securely transfer value between people digitally and without middlemen.
It’s ironic that the middlemen showed up anyway and busted all the security of those transfers
You want some bipcoin to buy weed drugs on the slip road? Don’t bother figuring out how to set up that wallet shit, come to our nifty token exchange where you can buy and sell all kinds of bipcoins
oh btw every government on the planet showed up and dug through our insecure records. hope you weren’t actually buying shroom drugs on the slip rod
also we got hacked, you lost all your bipcoins sorry
At least, that’s my recollection of events. I was getting my illegal narcotics the old fashioned way.
You want some bipcoin to buy weed drugs on the slip road? Don’t bother figuring out how to set up that wallet shit, come to our nifty token exchange where you can buy and sell all kinds of bipcoins
Maybe I’m slow today, but what is this referencing? Most dark web sites use Monero. Is there some centralized token that people used instead?
Edit: Oh, I guess you’re referring to Mt.Gox? I mean yeah, people were pretty stupid for keeping their bitcoin in exchange wallets (and sending it right to the drug dealers directly from there? Real dumb). That’s always a bad idea. I don’t think they transferred it there instead of something else, they just never took custody of the coins after buying them on the exchange.
Monero
Satoshi was right and Crypto absolutely has valid use cases. What if your government doesn’t want you accessing meds you need at prices you can afford? What if your government doesn’t like your sexual orientation, but you want a subscription to a dating site? What if your government throws up unjust export controls or tariffs that suddenly make you and your business impossible?
Crypto’s best killer use case is uncensorable, untraceable money
Bitcoin is neither of those things. There is a reason people buy heroin with Monero. It actually does what crypto is supposed to do, which means it could safeguard your Grindr XTRA subscription.
Yeah Monero is just cool, tech-wise
the old fashioned way.
A whole swath of trained toads using a special made tube network?
Nah, they clearly meant in liquid form.
getting into a car with a stranger who said he was 15 minutes away two hours ago
You were there too? 😜
Holy shit do I not miss that lifestyle
also we got hacked, you lost all your bipcoins sorry
aaaaaaaaand - it’s gone!
Much drama.
I agree about semantic web, but the issue is with all of the Internet. Both its monopoly as the medium of communication, and its architecture.
And if we go semantic for webpages, allowing the clients to construct representation, then we can go further, to separate data from medium, making messages and identities exist in a global space, as they (sort of, need a better solution) do in Usenet.
About the Internet itself being the problem - that’s because it’s hierarchical, despite appearances, and nobody understands it well. Especially since new systems of this kind are not being built often, to say the least, so the majority of people using the Internet doesn’t even think about it as a system. It takes it for given that this is the only paradigm for the global network. And that it’s application-neutral, which may not be true.
20 years ago, when I was a kid, people would think and imagine all kinds of things about the Internet and about the future and about ways all this can break, and these were normal people, not tech types, and one would think with time we wouldn’t become more certain, as it becomes bigger and bigger.
OK, I’m just having an overvalued idea that the Internet is poisoned. Bad sleep, nasty weather, too much sweets eaten. Maybe that movement of packets on the IP protocol can somehow give someone free computation, with enough machines under their control, by using counters in the network stack as registers, or maybe something else.
Sound like it went the same way everything else went. The less money is involved the more trustworthy it is.
Mr. Internet, tear down these walls! (for all these walled gardens)
Return the internet to the wild. Let it run feral like dinosaurs on an island.
Let the grannies and idiots stick themselves in the reservations and asylums run by billionaires.
Let’s all make Neocities pages about our hobbies and dirtiest, innermost thoughts. With gifs all over.
I’m down with that. Web 1.5? Let’s do it. I’ll get my Geocities page up and then we can rev up that hit counter.
https://homestarrunner.com/toons/backtoawebsite
“Lemme get that hit counter!”
Capitalism is grand, innit. Wait, not grand, I meant to say cancer
I feel like half of the blame capitalism gets is valid, but the other half is just society. I don’t care what kind of system you’re under, you’re going to have to deal with other people.
Oh, and if you try the system where you don’t have to deal with people, that just means other people end up handling you.
I would give this reddit gold
Instant easy complaints help-i’m-oppressed-by-Capitalism today sound an awful lot like the instant easy complaints help-i’m-oppressed-by-Communism I used to hear from rednecks
Ask someone who starved & died under either system how obviously superior it is, you will find millions on either side
Also consider that Socialism is totally legal under Capitalism. Want to start a co-op? Go for it. Want to legislate and implement socialized healthcare? Many Capitalist countries have.
Under Communism, Capitalism must be illegal and stamped out by force. Want to start a business making shoes and hire someone to work for an agreed upon wage? Illegal.
When the goal involves guaranteeing positive rights, I’m not sure how it can be achieved without coercion. Which is how any socialist policies get implemented under capitalism anyways.
Could you clarify on what you mean with “dealing with people”? I’m not really sure the point you’re trying to make with that
The complaint that got blamed on capitalism was:
The information age gave way for the misinformation age, where everything is fake.
and if there’s one entity/person most responsible for that, it’s Putin or the GOP. Most of it is political, and very little to do with capitalism itself. Except that capitalism surrounds and is intertwined with everything.
Still, if you get rid of capitalism, it doesn’t get rid of politics. I’d argue that the root of the issue is the GOP trying to hoard power (money and otherwise), and power is going to exist with or without capitalism. Is North Korea capitalist? Do they have issues with disinfo?
This Christian Sharia Law movement doesn’t exist for money.
Capitalism itself is political. But the point I was making was that capitalism was the driving force of enshittenfication of all our technology that could be used for helping us all but instead it’s only about profit. Which is capitalism…
The neat part is that anything bad that happens under capitalism is capitalism’s fault, but anything good that happens is actually socialism happening in spite of capitalism, somehow.
Could you give some examples?
Socialized healthcare
And in what way does capitalism socialize healthcare in the United States?
Socialized healthcare exists there – at least until the current administration finishes ripping it away
It matters a lot though what kind of goal the system incentivises. Imagine if it was people’s happiness and freedom instead of quarterly profits.
Imagine if it was people’s happiness and freedom instead of quarterly profits
- Whose happiness and freedom?
- How is it to be measured?
- Capitalists honestly believe that free trade is the best albeit flawed way to do both of the above
It’s definitely valid to disagree about point #3, but then you need to give a better model for #1 and #2
That’s the part people never really seem to understand. It makes sense though because we’re subjected to the system from birth and it’s all a lot of people know so they can’t grasp the idea of a world outside of that so it can sometimes be difficult to get through to people on that
In this case it is purely fault of the money incentive though. Noone would spend so much effort and computation power on AI if they didn’t think it could make them money.
The funniest part is though that it’s only theoretical anyway, everyone is only losing on it and they’re most likely never gonna make it back.
Web3 was about enabling us to securely transfer value between people digitally and without middlemen
I don’t think it ever was that, I think folding ideas has the best explanation of what it was meant to be, it was meant to be a way to grab power, away from those who already have it
Anubis isn’t supposed to be hard to avoid, but expensive to avoid. Not really surprised that a big company might be willing to throw a bunch of cash at it.
This is what I’ve kept saying about POW being a shit bot management tactic. Its a flat tax across all users, real or fake. The fake users are getting making money to access your site and will just eat the added expense. You can raise the tax to cost more than what your data is worth to them, but that also affects your real users. Nothing about Anubis even attempts to differentiate between bots and real users.
If the bots take the time, they can set up a pipeline to solve Anubis tokens outside of the browser more efficiently than real users.
Yeah but ai companies are losing money so in the long run Anubis seems like it should eventually return to working.
It’s the usual enshittification tactic. Make AI cheap so companies fire tech workers. Keep it cheap long enough that we all have established careers as McDonald’s branch managers, then whack up the prices once they’re locked in.
Costs of solving PoW for Anubis is absolutely not a factor in any AI companies budget. Just the costs of answering one question is millions of times more expensive than running sha256sum for Anubis.
Just in case you’re being glib and mean the businesses will go under regardless of Anubis: most of these are coming from China. China absolutely will keep running these companies at a loss for the sake of strategic development.
Thanks for the info 👍 would not have thought Anubis would be so irrelevant
What the alternative?
Not much for open source solutions. A simple captcha however would cost scrapers more to crack than Anubis.
But when it comes to “real” bot management solutions: The least invasive solutions will try to match User-Agent and other headers against the TLS fingerprint and block if they don’t match. More invasive solutions will fingerprint your browser and even your GPU, then either block you or issue you a tracking cookie which is often pinned to your IP and user-agent. Both of those solutions require a large base of data to know what real and fake traffic actually looks like. Only large hosting providers like CloudFlare and Akamai have that data and can provide those sorts of solutions.
No, it’s expensive to comply (at a massive scale), but easy to avoid. Just change the user agent. There’s even a dedicated extension for bypassing Anubis.
Even then AI servers have plenty of compute, it realistically doesn’t cost much. Maybe like a thousandth of a cent per solve? They’re spending billions on GPU power, they don’t care.
I’ve been saying this since day 1 of Anubis but nobody wants to hear it.
The website would also have to display to users at the end of the day. It’s a similar problem as trying to solve media piracy. Worst comes to it, the crawlers could read the page like a person would.