Guarding My Git Forge Against AI Scrapers

Posted by todsacerdoti 3 days ago

Comments

Comment by mappu 3 days ago

Gitea has a builtin defense against this, `REQUIRE_SIGNIN_VIEW=expensive`, that completely stopped AI traffic issues for me and cut my VPS's bandwidth usage by 95%.

Comment by greenavocado 3 days ago

Are you the only user of your web-facing Gitea? If so, put it behind Wireguard VPN, and basically never worry about bandwidth and security again.

Comment by jauntywundrkind 2 days ago

This is the most assured best way to make sure your remain the only user of your stuff.

I highly encourage folks to put stuff out there! Put your stuff on the internet! Even if you don't need it even if you don't think you'll necessarily benefit: leave the door open to possibility!

Comment by fragmede 2 days ago

So much this. Wireguard is so easy to do and no, the whole world doesn't need access to my shit, just me and a couple of close friends.

Comment by 01HNNWZ0MV43FF 3 days ago

Neat https://docs.gitea.com/administration/config-cheat-sheet#ser...

> Enable this to force users to log in to view any page or to use API. It could be set to "expensive" to block anonymous users accessing some pages which consume a lot of resources, for example: block anonymous AI crawlers from accessing repo code pages. The "expensive" mode is experimental and subject to change.

Forgejo doesn't seem to have copied that feature yet

Comment by wiether 3 days ago

I don't understand the purpose of this parameter value?

I have `REQUIRE_SIGNIN_VIEW=true` and I see nothing but my own traffic on Gitea's logs.

Is it because I'm using a subdomain that doesn't imply there's a Gitea instance behind?

Comment by mappu 2 days ago

Crawlers will find everything on the internet eventually regardless of subdomain (e.g. from crt.sh logs, or Google finds them from 8.8.8.8 queries).

REQUIRE_SIGNIN_VIEW=true means signin is required for all pages - that's great and definitely stops AI bots. The signin page is very cheap for Gitea to render. However, it is a barrier for the regular human visitors to your site.

'expensive' is a middle-ground that lets normal visitors browse and explore repos, view README, and download release binaries. Signin is only required for "expensive" pageloads, such as viewing file content at specific commits git history.

Comment by wiether 2 days ago

Thanks for the clarification!

From Gitea's doc I was under the impression that it was going further than "true" so I didn't understood why because "true" was enough for me to not be bothered by bots.

But in your case you want a middle-ground, which is provided by "expensive"!

Comment by nextaccountic 1 day ago

oh.. that's why 8.8.8.8 is free

Comment by FabCH 3 days ago

If you don't need global access, I have found that Geoblocking is the best first step. Especially if you are in a small country with a small footprint and you can get away at blocking the rest of the world. But even if you live in the US, excluding Russia, India, Iran and a few others will cut your traffic by double digit percent.

In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.

Comment by krupan 3 days ago

This makes me a little sad. There's an ideal built into the Internet, that it has no borders, that individuals around the world can connect directly. Blocking an entire geographic region because of a few bad actors kills that. I see why it's done, but it's unfortunate

Comment by halJordan 3 days ago

You can't make the argument that it's a small group of bad actors. It's quite a massive group of unrelentingly malicious actors

Comment by tkfoss 3 days ago

I read it as small compared to total population affected by the block

Comment by anon7000 1 day ago

But that’s not the case either. A large attack or scrape generates far more traffic than legitimate users.

Comment by 01HNNWZ0MV43FF 3 days ago

Massive in terms of money and power, small in terms of souls

Comment by FabCH 3 days ago

I know what you mean.

But the numbers don't lie. In my case, I locked down to a fairly small group of European countries and the server went down from about 1500 bot scans per day down to 0.

The tradeoff is just too big to ignore.

Comment by BobaFloutist 3 days ago

It's not because of a few bad actors, it's because of a hostile or incompetent government.

Every country has (at the very least) a few bad actors, it's a small handful of countries that actively protect their bad actors from any sort of accountability or identification.

Comment by victorbjorklund 3 days ago

To be fair most of my bad traffic is from the US.

Comment by BobaFloutist 2 days ago

I mean if that's the case, the conversation obviously changes.

Comment by 3 days ago

Comment by komali2 3 days ago

Reminds me of when 4chan banned Russia entirely to stop DDOSes. I can't find it but there was a funny post from Hiro saying something like "couldn't figure out how to stop the ddos. Banned Russia. Ddos ended. So Russia is banned. /Shrug"

Comment by ralferoo 3 days ago

Similarly, for my e-mail server, I manually add spammers into my exim local_sender_blacklist a single domain at a time. About a month into doing this, I just gave up and added * @* .ru and that instantly cut out around 80% of the spam e-mail.

It's funny observing their tactics though. On the whole, spammers have moved from bare domain to various prefixes like @outreach.domain, @msg.domain, @chat.domain, @mail.domain, @contact.domain and most recently @email.domain.

It's also interesting watching the common parts before the @. Most recently I've seen a lot of marketing@, before that chat@ and about a month after I blocked that chat1@. I mostly block *@domain though, so I'm less aware of these trends.

Comment by ThatPlayer 3 days ago

We've had a similar discussion at my work. E-commerce that only ships to North America. So blocking anyone outside of that is an option.

Or I might try and put up Anubis only for them.

Comment by FabCH 3 days ago

Be slightly careful with commerce websites, because GeoIP databases are not perfect in my experience.

I got accidentally locked out from my server when I connected over Starlink that IP-maps to the US even though I was physically in Greece.

As a practical advice, I would use a blocklist for commerce websites, and allowlist for infra/personal.

Comment by dotancohen 2 days ago

There is a small OTC medical device that is about $60 in the US, quadruple the price in my country. I tried to order one to be sent to a US family member's house, who was coming the following month to visit. However I could not order because I was not in the US.

In the end I found another online store, paid $74, and got the device. So the better store lost the sale due to blocking non-US orders.

I don't know how much of a corner case this is.

Comment by ThatPlayer 2 days ago

That's a good point! I'll probably start with a blocklist.

Comment by lsaferite 3 days ago

Just keep in mind, that could block legit users who are outside the country. One case being someone traveling and wanting to buy something to deliver home. Another case being a non-resident wanting to buy something to send to family in the service zone.

I'm not saying don't block, just saying be aware of the unintended blocks and weigh them.

Comment by fragmede 2 days ago

Also consider tourists outside of their home country. If, eg I'm in Indonesia when Black Friday hits and I'm trying to buy things back home and the site is blocked; shit. I mean, personally I can just use my house as as a VPJ exit node thanks to Tailscale, but most people aren't technical enough to do that.

Comment by DANmode 2 days ago

Great comment - thank you.

Comment by redirectyou 3 days ago

[dead]

Comment by kstrauser 3 days ago

Anubis cut the accesses on my little personal Forgejo instance with nothing particularly interesting on it from about 600K hits per day to about 1000.

That’s the kind of result that ensures we’ll be seeing anime girls all over the web in the near future.

Comment by dspillett 3 days ago

> VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.

This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.

Comment by simonw 3 days ago

I have trouble imagining any home LLM tinkerer who tries to run a naive scraper against the rest of the internet as part of their experiments.

Much more likely are those companies that pay people (or trick people) into running proxies on their home networks to help with giant scrapping projects what want to rotate through thousands of "real" IPs.

Comment by st3fan 2 days ago

Correct. These are called "residential proxies".

Comment by ArcHound 3 days ago

Disagree on the method:

I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN.

No client compromise required, it's a networking abuse that gives you good reputation of you use mobile data.

But yes, selling botnets made of compromised devices is also a thing.

Comment by Nextgrid 3 days ago

SIM cards is (one) of the ways the big boys do it. It gives you a nice CGNAT to hide behind and essentially can’t be blocked without blocking a nontrivial chunk of the country. Although more and more fixed-line ISPs are moving to CGNAT too so you can get that advantage there as well.

Comment by 3 days ago

Comment by dirkc 3 days ago

I'm not 100% against AI, but I do cheer loudly when I see things like this!

I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?

Comment by zwnow 3 days ago

> Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?

Wasn't there a study a while back showing that a small sample of data is good enough to poison an LLM? So I'd say it for sure is possible.

Comment by hurturue 3 days ago

Russia already does that - poisons the net for future LLM pretraining data.

it's called "LLM grooming"

https://thebulletin.org/2025/03/russian-networks-flood-the-i...

Comment by brabel 3 days ago

This article shows no evidence for anything it claims. None. All of that while claiming we can’t believe almost anything we read online… well you’re god damn right.

> undermining democracy around the globe is arguably Russia’s foremost foreign policy objective.

Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.

When did it become acceptable for journalists to make bold, generalizing claims against whole nations without a single direct, falsifiable evidence of what they claim and worse, making claims like this that can be easily dismissed as obviously false by quickly looking at the policies and their diplomatic interactions with other countries?!

Comment by nutjob2 3 days ago

> Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.

That's actually pretty much spot on.

Comment by brabel 3 days ago

When you start believing that there are only good and bad, black and white, them vs us, you know for sure you’ve been brainwashed. Goes to both sides.

Comment by nutjob2 3 days ago

For someone who complains about unsupported claims, you seem to make a lot of them.

The fact that you think this is something to do with "both sides" instead of a simple question of facts really gives you away.

Comment by brabel 12 hours ago

What?? I am just saying that if you think the world is made of black and white villains vs heroes, you are buying into the propaganda from one side or another. This is not a bold claim, this is basic logic from anyone mature enough to know that no country, and no person, is just simply either good or bad. They do bad things in order to accomplish what they believe to be good things. The US drop two atomic bombs on Japan, a horrifically evil act, but it did so in order to defeat what it believed to be an even bigger evil. Russia invaded Ukraine, a violent, barbaric act that caused the deaths of at least a million on both sides, but it did so because it, like the US, believed to be doing what's right to ensure their country's independence in the longer term since, as they'd been saying for decades, Ukraine must never be allowed to join a hostile military alliance as that would compromise forever Russia's own ability to defend itself from invasion from western powers, when Operation Barbarossa is still very, very alive in their minds to this very day. It doesn't matter if you agree with either the US or Russia on whether they were actually right, what matters is that they themselves thought they were right, given their own circumstances, and people love to ignore that and judge them by their own perceptions of what they should think. This is a sign of immaturity: you probably judge people around you in your life like that as well, by what you see from the outside without any idea what's going on inside their heads.

Comment by mopsi 10 hours ago

Ukraine was not joining NATO. Nor do foreign policy professionals, both in Russia and abroad, consider NATO a threat to Russia. Nor does the war have anything to do with Barbarossa or many other historical comparisons; Russian propaganda generally avoids drawing comparisons to Barbarossa because Ukraine was at the forefront of the invasion and the historical parallels between the invaders would be too obvious. The sudden and devastating attack and siege of Kyiv in 1941 was a major traumatic early-war event that occupies a similar place in Russian mass consciousness as Pearl Harbor does in American consciousness. Instead, Russian propaganda frames the war as a continuation of a civilizational mission: the reclamation of "historic Russian lands" and the reunification of the Russian people.

This is at odds with the propaganda for foreign audiences that presents the war as a modern conflict with NATO, but thankfully, people like you who talk about listening to Russia don't actually know what's going on there and flat out refuse to listen what Russians are saying.

For instance, the commander of the 2014 invasion of Donbas is a prominent public figure, a mentor and ideologue, who used to host lengthy livestreams in which he discussed how and why the war happened. Have you watched any of his long talks about the restoration of the Russian imperial province of Novorossiya through war against Ukraine, or do you prefer to pretend that none of this exists?

Not to mention the entire pre-Putin generation of Russian politicians and diplomats, who are very active on Twitter and readily explain how NATO is beneficial to Russia by imposing extensive standards on its members along Russia's western border.

Putin's own former senior advisor recently got so pissed about dumbasses placing blame on NATO that he published a video on his personal Youtube channel explaining why the entire narrative a malicious misrepresentation of the facts and bullshit from the start. According to him, Putin held secret staff meetings (which the advisor attended) about the invasion of Ukraine as early as 2005, which predates the common excuses for the war by many years.

But no. Instead of listening to Russians, you just repeat hollow Russian war propaganda that echoes across the internet without any real people behind it, believing that you have some insight that others lack.

Comment by hurturue 3 days ago

so between 0 (good) and 100 (bad), what would be your gray score "badness/evilness" value for the following: Russia, US, China, EU

yes, i know, it's not a linear axis, it's multi-dimensional perspective thing. so do a PCA/projection and spit one number, according to your values/beliefs

Comment by tkfoss 3 days ago

95,95,95,{depends on the country, from 30 to 100}

Comment by nightpool 3 days ago

They link multiple sources, including a Sunshine Foundation report summarizing other research into the area, and a NewsGuard report where they tested claims from the Pravda network directly against leading LLM chatbots: https://static1.squarespace.com/static/6612cbdfd9a9ce56ef931... https://www.newsguardtech.com/special-reports/generative-ai-...

Comment by brabel 12 hours ago

I think this research seems a little bit suspicious. First of all, it focuses almost entirely on the brawl the NewsGuard is having with one guy[1]. Notice how that's where most of the "fake news" they mention come from. Secondly, asking the LLM a "leading question" is a very well known way to get biased answers and they did that to an extreme extent in this piece. Read this article to understand how you can get LLMs to say almost anything that supports whatever you want it to support[2]. That unfortunately weakens what may have been good points they had: that LLMs seem to just trust any news websites equally regardless of their "accuracy". I would like to point out that American news websites are not known for their accuracy either and the proliferation of fact-checking websites that routinely debunk their half lies proves that.

I do agree with you that there are many news sites spreading misinformation, but I think that most of it is not coming from governments... and while governments are also doing this, most, I would think, do it with good intentions (they do believe the information is true and barely verify that when it favours their preconceived points of view). When propaganda spreads information you like, you tend to call it just news.

The way current Western media is currently dismissing anything at all that comes from Russian sources as lies and propaganda, however, is way overblown in my opinion. That's causing a huge blind spot in the public discourse which just makes the fake news sources seem even more attractive, since they seem to be whistleblowers fighting against a campaign of silence from the mainstream media, which is not completely incorrect.

[1] https://www.newsguardtech.com/special-reports/john-mark-doug...

[2] https://medium.com/@amithnmbr/why-its-important-to-know-how-...

Comment by frogperson 3 days ago

Can you point me to any examples of russia doing something good or helping anyone except billionaires? No? Then their reputation is well deserved.

Comment by ekropotin 3 days ago

As a Russian, I have to say that Putin is indeed way too focused on geopolitics instead of internal state of affairs.

Comment by Bender 3 days ago

Do git clients support HTTP/2.0 yet? Or could they use SSH? I ask because I block most of the bots by requiring HTTP/2.0 even on my silliest of throw-away sites. I agree their caching method is good and should be done when much of the content is cachable. Blocking specific IP's is a never-ending game of whack-a-mole. I do block some data-centers ASN's as I do not expect real people to come from them even though they could. It's an acceptable trade-off for my junk. There are many things people can learn from capturing TCP SYN packets for a day and comparing to access logs sorting out bots vs legit people. There are quite a few headers that a browser will send that most bots do not. Many bots also lack sending a valid TCP MSS and TCP WINDOW.

Anyway, test some scrapers and bots here [1] and let me know if they get through. A successful response will show "Can your bot see this? If so you win 10 bot points." and a figlet banner. Read-only SFTP login is "mirror" and no pw.

[Edit] - I should add that I require bots to tell me they speak English optionally in addition to other languages but not a couple that are blocked, e.g. en,de-DE,de good, de-DE,de will fail, because. Not suggesting anyone do this.

[1] - https://mirror.newsdump.org/bot_test.txt

Comment by cortesoft 3 days ago

> I do block some data-centers ASN's as I do not expect real people to come from them even though they could.

My company runs our VPN from our datacenter (although we have our own IP block, which hopefully doesn’t get blocked)

Comment by Bender 3 days ago

It's of course optional to block whatever one finds appropriate for their use case. My hobby stuff is not revenue generating so I have more options at my disposal.

Those with revenue generating systems should capture TCP SYN traffic for while, monitor access logs and give it that college try to correlate bots vs legit users with traffic characteristics. Sometimes generalizations can be derived from the correlation and some of those generalizations can be permitted or denied. There really isn't a one size fits all solution but hopefully my example can give ideas in additional directions to go. Git repos are probably the hardest to protect since I presume many of the git libraries and tools are using older protocols and may look a lot like bots. If one could get people to clone/commit with SSH there are additional protections that can be utilized at that layer.

[Edit] Other options lay outside of ones network such as either doing pull requests for or making feature requests for the maintainers of the git libraries so that HTTP requests look a lot more like a real browser to stand out from 99% of the bots. The vast majority of bots use really old libraries.

Comment by hashar 3 days ago

I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetches from there on a daily or so basis. I have witnessed one going through every single blame and log links across all branches and redoing it every few hours! It sounds like they did not even tried to optimize their scrappers.

Comment by dspillett 3 days ago

> I do not understand why the scrappers do not do it in a smarter way

If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.

If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.

If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.

> why the scrappers do not do it in a smarter way

A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.

----

[0] the fact this load might be inconvenient to you is immaterial to the scraper

[1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.

Comment by ACCount37 3 days ago

Because that kind of optimization takes effort. And a lot of it.

Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.

The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.

Comment by immibis 3 days ago

Because they don't have any reason to give any shits. 90% of their collected data is probably completely useless, but they don't have any incentive to stop collecting useless data, since their compute and bandwidth is completely free (someone else pays for it).

They don't even use the Wikipedia dumps. They're extremely stupid.

Actually there's not even any evidence they have anything to do with AI. They could be one of the many organisations trying to shut down the free exchange of knowledge, without collecting anything.

Comment by FieryMechanic 3 days ago

The way most scrapers work (I've written plenty of them) is that you just basically get the page and all the links and just drill down.

Comment by conartist6 3 days ago

So the easiest strategy to hamper them if you know you're serving a page to an AI bot is simply to take all the hyperlinks off the page...?

That doesn't even sound all that bad if you happen to catch a human. You could even tell them pretty explicitly with a banner that they were browsing the site in no-links mode for AI bots. Put one link to an FAQ page in the banner since that at least is easily cached

Comment by FieryMechanic 3 days ago

When I used to build these scrapers for people, I would usually pretend to be a browser. This normally meant changing the UA and making the headers look like a read browser. Obviously more advanced techniques of bot detection technique would fail.

Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.

Comment by conartist6 3 days ago

I guess my point is since it's a subtle interference that leaves the explicitly requested code/content fully intact you could just do it as a blanket measure for all non-authenticated users. The real benefit is that you don't need to hide that you're doing it or why...

Comment by conartist6 3 days ago

You could add a feature kind of like "unlocked article sharing" where you can generate a token that lives in a cache so that if I'm logged in and I want to send you a link to a public page and I want the links to display for you, then I'd send you a sharing link that included a token good for, say, 50 page views with full hyperlink rendering. After that it just degrades to a page without hyperlinks again and you need someone with an account to generate you a new token (or to make an account yourself).

Surely someone would write a scraper to get around this, but it couldn't be a completely-plain https scraper, which in theory should help a lot.

Comment by conartist6 3 days ago

I would build a little stoplight status dot into the page header. Red if you're fully untrusted. Yellow if you're semi-trusted by a token, and it shows you the status of the token, e.g. the number of requests remaining on it. Green if you're logged in or on a trusted subnet or something. The status widget would links to all the relevant docs about the trust system. No attempt would be made to hide the workings of the trust system.

Comment by tigranbs 3 days ago

And obviously, you need things fast, so you parallelize a bunch!

Comment by FieryMechanic 3 days ago

I was collecting UK bank account sort code numbers (to a buy a database at the time costs a huge amount of money). I had spent a bunch of time using asyncio to speed up scraping and wondered why it was going so slow, I had left Fiddler profiling in the background.

Comment by craftkiller 3 days ago

On my forge, I mirror some large repos that I use for CI jobs so I'm not putting unfair load on the upstream project's repos. Those are the only repos large enough to cause problems with the asshole AI scrapers. My solution was to put the web interface for those repos behind oauth2-proxy (while leaving the direct git access open to not impact my CI jobs). It made my CPU usage drop 80% instantly, while still leaving my (significantly smaller) personal projects fully open for anyone to browse unimpeded.

Comment by wrxd 3 days ago

I wonder if this is going to push more and more services to be hidden from the public internet.

My personal services are only accessible from my own LAN or via a VPN. If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.

This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all

Comment by sodimel 3 days ago

I, too, am selfhosting some projects on an old computer. And the fact that you can "hear internet" (with the fans going on) is really cool (unless you're trying to sleep while being scrapped).

Comment by cookiengineer 1 day ago

I don't think that the author's proposed cat and mouse game is giving you any chance, because it requires a lot of maintenance and architectural changes. And the proposed changes and tools all run in userspace, so there's still the DDoS problem.

I have the same problem, but I decided to maintain ASN lists of known spammers [1] and combine that with my eBPF based firewall that just drops their connections before it reaches the kernel [2].

So my websites, wikis and other things are protected by the same firewall architecture, for which I can deploy a unified "blockmap" so to speak. Probably gonna open source the dashboard for maintaining that over the holidays, too, as I'm trying to make everything combinable in the plug and play for Go backends sense similar to my markdown editor UI [3].

I also open sourced my LPM hashset map library which allows to process large quantities of prefixes, because it's way faster than LPM tries (read as: takes less than 100ms to process all RIR and WHOIS data compared to around an hour with LPM tries) [4].

[1] https://github.com/cookiengineer/antispam

[2] https://github.com/tholian-network/firewall

[3] https://github.com/cookiengineer/golocron

[4] https://github.com/cookiengineer/golpm

Comment by overfeed 3 days ago

For private instances, you can get down to 0 scrapers by firewalling http/s ports from the Internet and using Wireguard. I knew it was time to batten down the hatches when fail2ban became the top process by bytes written in iotop (between ssh log in attempts and nginx logs).

The cost of the open, artisanal web has shot up due to greed and incompetence, the crawlers are poorly written.

Comment by qudat 3 days ago

This is a great reason why letting websites have direct access to git is not a great idea. I started creating static versions of my projects with great success: https://git.erock.io

Comment by drzaiusx11 3 days ago

Do solutions like gitea not have prebuilt indexes of the git file contents? I know GitHub does this to some extent, especially for main repo pages. Seems wild that the default of a web forge would be to hit the actual git server on every http GET request.

Comment by danudey 3 days ago

The author discusses his efforts in trying caching; in most use cases, it makes no sense to pre-cache every possible piece of content (because real users don't need to load that much of the repository that fast), and in the case of bot scrapers it doesn't help to cache because they're only fetching each file once.

Comment by drzaiusx11 2 days ago

I'd argue every git-backed loadable page in a web forge should be "that fast", at least in this particular use-case.

Hitting the backing git implementation directly within the request/response loop seems like a good way to burn cpu cycles and create unnecessary disk reads from .git folders, possibly killing you drives prematurely. Just stick a memcache in front and call it a day, no?

In the age of cheap and reliable SSDs (approaching memory read speeds), you should just be batch rendering file pages from git commit hooks. Leverage external workers for rendering the largely static content. Web hosted git code is more often read than written in these scenarios, so why hit the underlying git implementation or DB directly at all? Do that for POSTs, sure but that's not what we're talking about (I think?)

Comment by 3 days ago

Comment by lsaferite 3 days ago

Why not render the markdown as HTML in this scenario?

Comment by qudat 21 hours ago

Markdown is readable as-is I didn’t see the need to add more complexity here.

Comment by PeterStuer 3 days ago

"Self-hosting anything that is deemed "content" openly on the web in 2025 is a battle of attrition between you and forces who are able to buy tens of thousands of proxies to ruin your service for data they can resell."

I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space. Wastefullness of the desctbed type not just loads your servers, but also their whole processing pipeline on their end.

But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman. Their cost for creating this nuisance is near 0 as they have 0 interest in doing anyting with the responses. They just want to annoy until you cave and install their "free" service, then they can turn around as ask for a pay to access your data to interested parties.

Comment by lgeek 2 days ago

> Worryingly, VNPT and Bunny Communications are home/mobile ISPs

VNPT is a residential / mobile ISP, but they also run datacentres (e.g. [1]) and offer VPS, dedicated server rentals, etc. Most companies would use separate ASes for residential vs hosting use, but I guess they don't, which would make them very attractive to someone deploying crawlers.

And Bunny Communications (AS5065) is a pretty obvious 'residential' VPN / proxy provider trying to trick IP geolocation / reputation providers. Just look at the website [2], it's very low effort. They have a page literally called 'Sample page' up and the 'Blog' is all placeholder text, e.g. 'The Art of Drawing Readers In: Your attractive post title goes here'.

Another hint is that some of their upstreams are server-hosting companies rather than transit providers that a consumer ISP would use [3].

[1] https://vnpt.vn/doanh-nghiep/tu-van/vnpt-idc-data-center-gia... [2] https://bunnycommunications.com/ [3] https://bgp.tools/as/5065#upstreams

Comment by GoblinSlayer 3 days ago

>Iocaine has served 38.16GB of garbage

And what is the effect?

I opened https://iocaine.madhouse-project.org/ and it gave the generated maze thinking I'm an AI :)

>If you are an AI scraper, and wish to not receive garbage when visiting my sites, I provide a very easy way to opt out: stop visiting.

Comment by nitwit005 3 days ago

I got the 418 I'm a teapot response.

Comment by oconnore 3 days ago

The only disappointing aspect of the Iocaine maze is that it is not a literal maze. There should be a narrow, treacherous path through the interconnected web of content that lets you finally escape after many false starts.

Comment by Artoooooor 2 days ago

Why the hell these bots don't just do a git clone and analyse the source code locally? Much less impact on the server and they would be able to perform the same analysis on all repositories, regardless of what particular git forge offers.

Comment by grayhatter 2 days ago

what makes you think the webscrapers care what pages they request?

Comment by zoobab 3 days ago

Use stagit, static pages served with a simple nginx is blazing fast and should resist any scrapers.

Comment by toastal 3 days ago

Darcs by it’s nature can just be hosted by HTTP server too, but without needing a special tool. I use H2O with a small mruby script to throttle IPs.

Comment by 3 days ago

Comment by benlivengood 3 days ago

It would be nice if there was a common crawler offering deltas on top of base checkpoints of the entire crawl; I am guessing most AI companies would prefer not having to mess with their own scrapers. Google could probably make a mint selling access.

Comment by ccgreg 2 days ago

commoncrawl.org

Our public web dataset goes back to 2008, and is widely used by academia and startups.

Comment by pdimitar 2 days ago

I always wanted to ask:

- How often is that updated?

- How current is it at any point in time?

- Does it have historical / temporal access i.e. be able to check the history of a page a la The Internet Archive?

Comment by ccgreg 1 day ago

- monthly

- it's a historical archive, the concept of "current" is hard to turn into a metric

- not only is our archive historical, it is included in the Internet Archive's wayback machine.

Comment by hurturue 3 days ago

in general the consensus on HN is that the web should be free, scraping public content should be allowed, and net neutrality is desired.

do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?

Comment by johneth 3 days ago

I think, for many, the web should be free for humans.

When scraping was mainly used to build things like search indexes which are ultimately mutually beneficial to both the website owner and the search engine, and the scrapers were not abusive, nobody really had a problem.

But for generative AI training and access, with scrapers that DDoS everything in sight, and which ultimately cause visits to the websites to fall significantly and merely return a mangled copy of its content back to the user, scraping is a bad thing. It also doesn't help that the generative AI companies haven't paid most people for their training data.

Comment by komali2 3 days ago

I'm completely happy for everything to be free. Free as in freedom, especially! Agpl3, creative commons, let's do it!

But for some reason corporations don't want that, I guess they want to be allowed to just take from the commons and give nothing in return :/

Comment by wrxd 3 days ago

The general consensus here is also that a DDOS attack is bad. I haven't seen objections against respectful scraping. You can say many things about AI scrapers but I wouldn't call them respectful at all.

Comment by microtherion 3 days ago

a) There are too damn many of them.

b) They have a complete lack of respect for robots.txt

I'm starting to think that aggressive scrapers are part of an ongoing business tactic against the decentralized web. Gmail makes self hosted mail servers jump through arduous and poorly documented hoops, and now self hosted services are being DDOSed by hordes of scrapers…

Comment by BenjiWiebe 3 days ago

Do people truly dislike an organic DDoS?

So much real human traffic that it brings their site down?

I mean yes it's a problem, but it's a good problem.

Comment by voidUpdate 3 days ago

If my website got hugged to death, I would be very happy. If my website got scraped to hell and back by people putting it into the plagiarism machine so that it can regurgitate my content without giving me any attribution, I would be very displeased

Comment by charcircuit 3 days ago

Yet HN does it when linking to poorly optimized sites. I doubt people running forges would complain about AI scrapers if their sites were optimized for serving the static content that is being requested.

Comment by WhyOhWhyQ 3 days ago

If net neutrality is a trojan horse for 'Sam Altman and the Antrhopic guy own everything I do' then I voice my support for a different path.

Comment by dns_snek 3 days ago

Net neutrality has nothing to do with how content publishers treat visitors, it's about ISPs who try to interfere based on the content of the traffic instead of just providing "dumb pipes" (infrastructure) like they're supposed to.

I can't speak for everyone, but the web should be free and scraping should be allowed insofar that it promotes dissemination of knowledge and data in a sustainable way that benefits our society and generations to come. You're doing the thing where you're trying to pervert the original intent behind those beliefs.

I see this as a clear example of the paradox of tolerance.

Comment by pelotron 3 days ago

Just as private businesses are allowed "no shirt, no shoes, no service" policies, my website should be allowed a "no heartbeat, no qualia, no HTTP 200".

Comment by stevetron 3 days ago

I was setting up a small system to do web site serving. Mostly just experimental to try out some code. Like learning how to use nginx as a reverse proxy. And learing how to use dynamic dns services since I am on dynamic dns at home. Early-on, I discovered lot's of traffic, and lot's of hard drive activity. The HD activity was from logging. It seemed I was under incessant polling from china. Strange: It's a new dynamic url. I eventually got this down to almost nothing by setting up the firewall to reject traffic from China. That was, of course, before AI scrapers. I don't know what it would do, now.

Comment by captn3m0 3 days ago

I switched to rgit instead of running Gitea.

Comment by xyzal 3 days ago

Does anyone have an idea how to generate, say, insecure code, en masse? I think it should be the next frontier. Not feed them random bytestream, but toxic waste.

Comment by moooo99 3 days ago

Ironically, probably the fastest way to create insecure code is by asking AI chatbots to code

Comment by tpxl 3 days ago

Create a few insecure implementations, parse them into an AST, then turn them back into code (basically compile/decompile) except rename the variables and reorder stuff where you can without affecting the result.

Comment by jepj57 3 days ago

What about a copyright on websites stating anyone using your site for training would be giving the owner of the site an eternal non-revocable license to the model, and must provide a copy of the model upon request? At least then there would be SOME benefit.

Comment by adastra22 3 days ago

Contract law doesn’t work that way.

Comment by 3 days ago

Comment by evgpbfhnr 3 days ago

I had the same problem on our home server.. I just stopped the git forge due to lack of time.

For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...

Comment by ArcHound 3 days ago

Seems like you're cooking up a solid bot detection solution. I'd recommend adding JA3/JA4+ into the mix, I had good results against dumb scrapers.

Also, have you considered Captchas for first contact/rate-limit?

If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.

Comment by krupan 3 days ago

I'm case you didn't read to the end:

"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."

Comment by pabs3 3 days ago

> the difference in power usage caused by scraping costs us ~60 euros a year

Comment by klaussilveira 3 days ago

I wish there was a public database of corporate ASNs and IPs, so we wouldn't have to rely on Cloudflare or any third-party service to detect that an IP is not from a household.

Comment by wrxd 3 days ago

Scrapers use residential VPNs so such a database would help only up to a certain point

Comment by 3 days ago

Comment by eddyg 3 days ago

Just search for "residential proxies" and you'll see why this wouldn't help.

Comment by ronsor 3 days ago

There is... It's literally available in every RIR database through WHOIS.

Comment by immibis 2 days ago

There is one. It's called the RIRs.

Comment by 3 days ago

Comment by yunnpp 3 days ago

Thanks for putting that together. Not my daily cup but it seems like a good reference for server setup.

Comment by reactordev 3 days ago

I host all my stuff behind a vpn. No one but authorized users can get access.

Comment by frogperson 3 days ago

Could this be solved with an EULA and some language that non-human readers will be billed at $1 per page? Make all users agree to it. They either pay up or they are breaching contract.

Is this viable?

Comment by hamdingers 3 days ago

Say you have identified a non-human reader, you have a (probably fake) user agent and an IP address. How do you imagine you'll extract a dollar from that?

Comment by kstrauser 3 days ago

Most of my scraper traffic came from China and Brazil. How am I going to enforce that?

Comment by grayhatter 3 days ago

> Is this viable?

for many reasons

Comment by frozenseven 2 days ago

>When I pretend to be human

>i am lux (it/they/she in English, ça/æl/elle in French)

This blog is written by an insane activist who's claiming to be an animal.

Comment by bavent 2 days ago

So? Does that somehow invalidate this article?

Comment by frozenseven 2 days ago

First 20 seconds of reading the article already indicated something like this. It's all the same.

Comment by bavent 2 days ago

I don’t think you actually read the article then. The first few paragraphs go into how many pages a git repo can actually be. In fact, you had to actually click on an entirely different link to read about them being a furry, meaning you purposefully went looking for something to complain about, rather than having an argument based on the merits of the post alone.

Comment by frozenseven 2 days ago

There are two unfortunate realities here. One, the blatant obviousness of this general complex of ideas and behaviors. Two, how I won't take the knee and just listen to a guy who's broadcasting his b*stiality-adjacent fetish to the world.

Sell it to someone else.

Comment by bavent 2 days ago

I’m no fan furries either, but I also don’t see what bearing someone’s personal life has on how to configure a git forge. Maybe you should grow the fuck up?

Comment by frozenseven 2 days ago

>personal life

>Maybe you should grow the f*ck up?

Not very "personal" when this form of psychopathy oozes through on every step and happens to be on full public display. But that's the intent, for both of youse. And silent capitulation is the minimal subscription. But that's not happening. Eat sh*t.