Guarding My Git Forge Against AI Scrapers
Posted by todsacerdoti 3 days ago
Comments
Comment by mappu 3 days ago
Comment by greenavocado 3 days ago
Comment by jauntywundrkind 2 days ago
I highly encourage folks to put stuff out there! Put your stuff on the internet! Even if you don't need it even if you don't think you'll necessarily benefit: leave the door open to possibility!
Comment by fragmede 2 days ago
Comment by 01HNNWZ0MV43FF 3 days ago
> Enable this to force users to log in to view any page or to use API. It could be set to "expensive" to block anonymous users accessing some pages which consume a lot of resources, for example: block anonymous AI crawlers from accessing repo code pages. The "expensive" mode is experimental and subject to change.
Forgejo doesn't seem to have copied that feature yet
Comment by wiether 3 days ago
I have `REQUIRE_SIGNIN_VIEW=true` and I see nothing but my own traffic on Gitea's logs.
Is it because I'm using a subdomain that doesn't imply there's a Gitea instance behind?
Comment by mappu 2 days ago
REQUIRE_SIGNIN_VIEW=true means signin is required for all pages - that's great and definitely stops AI bots. The signin page is very cheap for Gitea to render. However, it is a barrier for the regular human visitors to your site.
'expensive' is a middle-ground that lets normal visitors browse and explore repos, view README, and download release binaries. Signin is only required for "expensive" pageloads, such as viewing file content at specific commits git history.
Comment by wiether 2 days ago
From Gitea's doc I was under the impression that it was going further than "true" so I didn't understood why because "true" was enough for me to not be bothered by bots.
But in your case you want a middle-ground, which is provided by "expensive"!
Comment by nextaccountic 1 day ago
Comment by FabCH 3 days ago
In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.
Comment by krupan 3 days ago
Comment by halJordan 3 days ago
Comment by tkfoss 3 days ago
Comment by anon7000 1 day ago
Comment by 01HNNWZ0MV43FF 3 days ago
Comment by FabCH 3 days ago
But the numbers don't lie. In my case, I locked down to a fairly small group of European countries and the server went down from about 1500 bot scans per day down to 0.
The tradeoff is just too big to ignore.
Comment by BobaFloutist 3 days ago
Every country has (at the very least) a few bad actors, it's a small handful of countries that actively protect their bad actors from any sort of accountability or identification.
Comment by victorbjorklund 3 days ago
Comment by BobaFloutist 2 days ago
Comment by komali2 3 days ago
Comment by ralferoo 3 days ago
It's funny observing their tactics though. On the whole, spammers have moved from bare domain to various prefixes like @outreach.domain, @msg.domain, @chat.domain, @mail.domain, @contact.domain and most recently @email.domain.
It's also interesting watching the common parts before the @. Most recently I've seen a lot of marketing@, before that chat@ and about a month after I blocked that chat1@. I mostly block *@domain though, so I'm less aware of these trends.
Comment by ThatPlayer 3 days ago
Or I might try and put up Anubis only for them.
Comment by FabCH 3 days ago
I got accidentally locked out from my server when I connected over Starlink that IP-maps to the US even though I was physically in Greece.
As a practical advice, I would use a blocklist for commerce websites, and allowlist for infra/personal.
Comment by dotancohen 2 days ago
In the end I found another online store, paid $74, and got the device. So the better store lost the sale due to blocking non-US orders.
I don't know how much of a corner case this is.
Comment by ThatPlayer 2 days ago
Comment by lsaferite 3 days ago
I'm not saying don't block, just saying be aware of the unintended blocks and weigh them.
Comment by fragmede 2 days ago
Comment by DANmode 2 days ago
Comment by redirectyou 3 days ago
Comment by kstrauser 3 days ago
That’s the kind of result that ensures we’ll be seeing anime girls all over the web in the near future.
Comment by dspillett 3 days ago
This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.
Comment by simonw 3 days ago
Much more likely are those companies that pay people (or trick people) into running proxies on their home networks to help with giant scrapping projects what want to rotate through thousands of "real" IPs.
Comment by st3fan 2 days ago
Comment by ArcHound 3 days ago
I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN.
No client compromise required, it's a networking abuse that gives you good reputation of you use mobile data.
But yes, selling botnets made of compromised devices is also a thing.
Comment by Nextgrid 3 days ago
Comment by dirkc 3 days ago
I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?
Comment by zwnow 3 days ago
Wasn't there a study a while back showing that a small sample of data is good enough to poison an LLM? So I'd say it for sure is possible.
Comment by hurturue 3 days ago
it's called "LLM grooming"
https://thebulletin.org/2025/03/russian-networks-flood-the-i...
Comment by brabel 3 days ago
> undermining democracy around the globe is arguably Russia’s foremost foreign policy objective.
Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.
When did it become acceptable for journalists to make bold, generalizing claims against whole nations without a single direct, falsifiable evidence of what they claim and worse, making claims like this that can be easily dismissed as obviously false by quickly looking at the policies and their diplomatic interactions with other countries?!
Comment by nutjob2 3 days ago
That's actually pretty much spot on.
Comment by brabel 3 days ago
Comment by nutjob2 3 days ago
The fact that you think this is something to do with "both sides" instead of a simple question of facts really gives you away.
Comment by brabel 12 hours ago
Comment by mopsi 10 hours ago
This is at odds with the propaganda for foreign audiences that presents the war as a modern conflict with NATO, but thankfully, people like you who talk about listening to Russia don't actually know what's going on there and flat out refuse to listen what Russians are saying.
For instance, the commander of the 2014 invasion of Donbas is a prominent public figure, a mentor and ideologue, who used to host lengthy livestreams in which he discussed how and why the war happened. Have you watched any of his long talks about the restoration of the Russian imperial province of Novorossiya through war against Ukraine, or do you prefer to pretend that none of this exists?
Not to mention the entire pre-Putin generation of Russian politicians and diplomats, who are very active on Twitter and readily explain how NATO is beneficial to Russia by imposing extensive standards on its members along Russia's western border.
Putin's own former senior advisor recently got so pissed about dumbasses placing blame on NATO that he published a video on his personal Youtube channel explaining why the entire narrative a malicious misrepresentation of the facts and bullshit from the start. According to him, Putin held secret staff meetings (which the advisor attended) about the invasion of Ukraine as early as 2005, which predates the common excuses for the war by many years.
But no. Instead of listening to Russians, you just repeat hollow Russian war propaganda that echoes across the internet without any real people behind it, believing that you have some insight that others lack.
Comment by hurturue 3 days ago
yes, i know, it's not a linear axis, it's multi-dimensional perspective thing. so do a PCA/projection and spit one number, according to your values/beliefs
Comment by tkfoss 3 days ago
Comment by nightpool 3 days ago
Comment by brabel 12 hours ago
I do agree with you that there are many news sites spreading misinformation, but I think that most of it is not coming from governments... and while governments are also doing this, most, I would think, do it with good intentions (they do believe the information is true and barely verify that when it favours their preconceived points of view). When propaganda spreads information you like, you tend to call it just news.
The way current Western media is currently dismissing anything at all that comes from Russian sources as lies and propaganda, however, is way overblown in my opinion. That's causing a huge blind spot in the public discourse which just makes the fake news sources seem even more attractive, since they seem to be whistleblowers fighting against a campaign of silence from the mainstream media, which is not completely incorrect.
[1] https://www.newsguardtech.com/special-reports/john-mark-doug...
[2] https://medium.com/@amithnmbr/why-its-important-to-know-how-...
Comment by frogperson 3 days ago
Comment by ekropotin 3 days ago
Comment by Bender 3 days ago
Anyway, test some scrapers and bots here [1] and let me know if they get through. A successful response will show "Can your bot see this? If so you win 10 bot points." and a figlet banner. Read-only SFTP login is "mirror" and no pw.
[Edit] - I should add that I require bots to tell me they speak English optionally in addition to other languages but not a couple that are blocked, e.g. en,de-DE,de good, de-DE,de will fail, because. Not suggesting anyone do this.
Comment by cortesoft 3 days ago
My company runs our VPN from our datacenter (although we have our own IP block, which hopefully doesn’t get blocked)
Comment by Bender 3 days ago
Those with revenue generating systems should capture TCP SYN traffic for while, monitor access logs and give it that college try to correlate bots vs legit users with traffic characteristics. Sometimes generalizations can be derived from the correlation and some of those generalizations can be permitted or denied. There really isn't a one size fits all solution but hopefully my example can give ideas in additional directions to go. Git repos are probably the hardest to protect since I presume many of the git libraries and tools are using older protocols and may look a lot like bots. If one could get people to clone/commit with SSH there are additional protections that can be utilized at that layer.
[Edit] Other options lay outside of ones network such as either doing pull requests for or making feature requests for the maintainers of the git libraries so that HTTP requests look a lot more like a real browser to stand out from 99% of the bots. The vast majority of bots use really old libraries.
Comment by hashar 3 days ago
Comment by dspillett 3 days ago
If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.
If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.
If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.
> why the scrappers do not do it in a smarter way
A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.
----
[0] the fact this load might be inconvenient to you is immaterial to the scraper
[1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.
Comment by ACCount37 3 days ago
Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.
The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.
Comment by immibis 3 days ago
They don't even use the Wikipedia dumps. They're extremely stupid.
Actually there's not even any evidence they have anything to do with AI. They could be one of the many organisations trying to shut down the free exchange of knowledge, without collecting anything.
Comment by FieryMechanic 3 days ago
Comment by conartist6 3 days ago
That doesn't even sound all that bad if you happen to catch a human. You could even tell them pretty explicitly with a banner that they were browsing the site in no-links mode for AI bots. Put one link to an FAQ page in the banner since that at least is easily cached
Comment by FieryMechanic 3 days ago
Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.
Comment by conartist6 3 days ago
Comment by conartist6 3 days ago
Surely someone would write a scraper to get around this, but it couldn't be a completely-plain https scraper, which in theory should help a lot.
Comment by conartist6 3 days ago
Comment by tigranbs 3 days ago
Comment by FieryMechanic 3 days ago
Comment by craftkiller 3 days ago
Comment by wrxd 3 days ago
My personal services are only accessible from my own LAN or via a VPN. If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.
This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all
Comment by sodimel 3 days ago
Comment by cookiengineer 1 day ago
I have the same problem, but I decided to maintain ASN lists of known spammers [1] and combine that with my eBPF based firewall that just drops their connections before it reaches the kernel [2].
So my websites, wikis and other things are protected by the same firewall architecture, for which I can deploy a unified "blockmap" so to speak. Probably gonna open source the dashboard for maintaining that over the holidays, too, as I'm trying to make everything combinable in the plug and play for Go backends sense similar to my markdown editor UI [3].
I also open sourced my LPM hashset map library which allows to process large quantities of prefixes, because it's way faster than LPM tries (read as: takes less than 100ms to process all RIR and WHOIS data compared to around an hour with LPM tries) [4].
[1] https://github.com/cookiengineer/antispam
[2] https://github.com/tholian-network/firewall
Comment by overfeed 3 days ago
The cost of the open, artisanal web has shot up due to greed and incompetence, the crawlers are poorly written.
Comment by qudat 3 days ago
Comment by drzaiusx11 3 days ago
Comment by danudey 3 days ago
Comment by drzaiusx11 2 days ago
Hitting the backing git implementation directly within the request/response loop seems like a good way to burn cpu cycles and create unnecessary disk reads from .git folders, possibly killing you drives prematurely. Just stick a memcache in front and call it a day, no?
In the age of cheap and reliable SSDs (approaching memory read speeds), you should just be batch rendering file pages from git commit hooks. Leverage external workers for rendering the largely static content. Web hosted git code is more often read than written in these scenarios, so why hit the underlying git implementation or DB directly at all? Do that for POSTs, sure but that's not what we're talking about (I think?)
Comment by PeterStuer 3 days ago
I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space. Wastefullness of the desctbed type not just loads your servers, but also their whole processing pipeline on their end.
But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman. Their cost for creating this nuisance is near 0 as they have 0 interest in doing anyting with the responses. They just want to annoy until you cave and install their "free" service, then they can turn around as ask for a pay to access your data to interested parties.
Comment by lgeek 2 days ago
VNPT is a residential / mobile ISP, but they also run datacentres (e.g. [1]) and offer VPS, dedicated server rentals, etc. Most companies would use separate ASes for residential vs hosting use, but I guess they don't, which would make them very attractive to someone deploying crawlers.
And Bunny Communications (AS5065) is a pretty obvious 'residential' VPN / proxy provider trying to trick IP geolocation / reputation providers. Just look at the website [2], it's very low effort. They have a page literally called 'Sample page' up and the 'Blog' is all placeholder text, e.g. 'The Art of Drawing Readers In: Your attractive post title goes here'.
Another hint is that some of their upstreams are server-hosting companies rather than transit providers that a consumer ISP would use [3].
[1] https://vnpt.vn/doanh-nghiep/tu-van/vnpt-idc-data-center-gia... [2] https://bunnycommunications.com/ [3] https://bgp.tools/as/5065#upstreams
Comment by GoblinSlayer 3 days ago
And what is the effect?
I opened https://iocaine.madhouse-project.org/ and it gave the generated maze thinking I'm an AI :)
>If you are an AI scraper, and wish to not receive garbage when visiting my sites, I provide a very easy way to opt out: stop visiting.
Comment by nitwit005 3 days ago
Comment by oconnore 3 days ago
Comment by Artoooooor 2 days ago
Comment by grayhatter 2 days ago
Comment by zoobab 3 days ago
Comment by toastal 3 days ago
Comment by benlivengood 3 days ago
Comment by ccgreg 2 days ago
Our public web dataset goes back to 2008, and is widely used by academia and startups.
Comment by pdimitar 2 days ago
- How often is that updated?
- How current is it at any point in time?
- Does it have historical / temporal access i.e. be able to check the history of a page a la The Internet Archive?
Comment by ccgreg 1 day ago
- it's a historical archive, the concept of "current" is hard to turn into a metric
- not only is our archive historical, it is included in the Internet Archive's wayback machine.
Comment by hurturue 3 days ago
do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?
Comment by johneth 3 days ago
When scraping was mainly used to build things like search indexes which are ultimately mutually beneficial to both the website owner and the search engine, and the scrapers were not abusive, nobody really had a problem.
But for generative AI training and access, with scrapers that DDoS everything in sight, and which ultimately cause visits to the websites to fall significantly and merely return a mangled copy of its content back to the user, scraping is a bad thing. It also doesn't help that the generative AI companies haven't paid most people for their training data.
Comment by komali2 3 days ago
But for some reason corporations don't want that, I guess they want to be allowed to just take from the commons and give nothing in return :/
Comment by wrxd 3 days ago
Comment by microtherion 3 days ago
b) They have a complete lack of respect for robots.txt
I'm starting to think that aggressive scrapers are part of an ongoing business tactic against the decentralized web. Gmail makes self hosted mail servers jump through arduous and poorly documented hoops, and now self hosted services are being DDOSed by hordes of scrapers…
Comment by BenjiWiebe 3 days ago
So much real human traffic that it brings their site down?
I mean yes it's a problem, but it's a good problem.
Comment by voidUpdate 3 days ago
Comment by charcircuit 3 days ago
Comment by WhyOhWhyQ 3 days ago
Comment by dns_snek 3 days ago
I can't speak for everyone, but the web should be free and scraping should be allowed insofar that it promotes dissemination of knowledge and data in a sustainable way that benefits our society and generations to come. You're doing the thing where you're trying to pervert the original intent behind those beliefs.
I see this as a clear example of the paradox of tolerance.
Comment by pelotron 3 days ago
Comment by stevetron 3 days ago
Comment by captn3m0 3 days ago
Comment by xyzal 3 days ago
Comment by moooo99 3 days ago
Comment by tpxl 3 days ago
Comment by jepj57 3 days ago
Comment by adastra22 3 days ago
Comment by evgpbfhnr 3 days ago
For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...
Comment by ArcHound 3 days ago
Also, have you considered Captchas for first contact/rate-limit?
If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.
Comment by krupan 3 days ago
"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."
Comment by pabs3 3 days ago
Comment by klaussilveira 3 days ago
Comment by wrxd 3 days ago
Comment by eddyg 3 days ago
Comment by ronsor 3 days ago
Comment by immibis 2 days ago
Comment by yunnpp 3 days ago
Comment by reactordev 3 days ago
Comment by frogperson 3 days ago
Is this viable?
Comment by hamdingers 3 days ago
Comment by kstrauser 3 days ago
Comment by grayhatter 3 days ago
no
for many reasons
Comment by frozenseven 2 days ago
>i am lux (it/they/she in English, ça/æl/elle in French)
This blog is written by an insane activist who's claiming to be an animal.
Comment by bavent 2 days ago
Comment by frozenseven 2 days ago
Comment by bavent 2 days ago
Comment by frozenseven 2 days ago
Sell it to someone else.
Comment by bavent 2 days ago
Comment by frozenseven 2 days ago
>Maybe you should grow the f*ck up?
Not very "personal" when this form of psychopathy oozes through on every step and happens to be on full public display. But that's the intent, for both of youse. And silent capitulation is the minimal subscription. But that's not happening. Eat sh*t.