I made my own Git
Posted by TonyStr 2 days ago
Comments
Comment by nasretdinov 2 days ago
[1] https://stackoverflow.com/questions/55998614/merge-made-by-r...
Comment by arunix 2 days ago
Comment by nasretdinov 2 days ago
Comment by Guvante 1 day ago
If there was a better way to handle "I needed to merge in the middle of my PR work" without introducing reverse merged permanently in the history I wouldn't mind merge commits.
But tools will sometimes skip over others work if you `git pull` a change into your local repo due to getting confused which leg of the merge to follow.
Comment by nasretdinov 1 day ago
Comment by direwolf20 2 days ago
Comment by pyrolistical 2 days ago
Comment by lmm 2 days ago
Comment by seba_dos1 2 days ago
Comment by ezst 2 days ago
https://www.mercurial-scm.org/pipermail/mercurial/2012-Janua...
Comment by nasretdinov 1 day ago
Comment by ezst 1 day ago
I really don't see any downside to recommending mercurial in 2026. Git isn't just inferior as a VCS in the subjective sense of "oh… I don't like this or that inconsistent aspect of its UI", but in very practical and meaningful ways (on technical merit) that are increasingly forgotten about the more it solidifies as a monopoly:
- still no support for branches (in the traditional sense, as a commit-level marker, to delineate series of related commits) means that a branchy-DAG is border-line useless, and tools like bisect can't use the info to take you at the series boundaries
- still no support for phasing (to mark which commits have been exchanged or are local-only and safe to edit)
- still no support for evolve (to record history rewrites in a side-storage, making concurrent/distributed history rewrites safe and mostly automatic)
Comment by pwdisswordfishy 2 days ago
Comment by valleyer 1 day ago
This resolves any number of heads, but the resulting tree of the merge is always
that of the current branch head, effectively ignoring all changes from all other
branches. It is meant to be used to supersede old development history of side
branches. Note that this is different from the -Xours option to the ort merge strategy.Comment by Brian_K_White 2 days ago
Comment by kbolino 2 days ago
[1]: https://git-scm.com/docs/merge-strategies#Documentation/merg...
Comment by pwdisswordfishy 2 days ago
> There already is reset hard.
That's not... remotely relevant? What does that have to do with merging? We're talking about merging.
Comment by Brian_K_White 1 day ago
I also "mean what I wrote". Man that was sure easy to say. It's almost like saying nothing at all. Which is anyone's righ to do, but it's not an argument, nor a definition of terms, nor communication at all. Well, it does communicate one thing.
Comment by pwdisswordfishy 1 day ago
> don't try to resolve any merge conflicts ... Don't try to "help". Don't fuck with the index or the worktree.
... certainly is "nothing" in the literal sense--that that's what is desired of git-merge to do, but it's not "nothing" in the sense that you're saying.
git reset --hard has nothing to do with merging. Nothing. They're not even in the same class of operations. It's absolutely irrelevant to this use case. And saying so isn't "not an argument" or not communicating anything at all. git reset --hard does not in any sense effect a merge. What more needs to be (or can be) said?
If you want someone to help explain something to you, it's up to you to give them an anchor point that they can use to bridge the gap in understanding. As it stands, it's you who's given nothing at all, so one can only repeat what has already been described--
A resolution strategy for merge conflicts that involves doing nothing: nothing to the files in the current directory, staging nothing to be committed, and in fact not even bothering to check for conflicts in the first place. Just notate that it's going to be a merge between two parents X and Y, and wait for the human so they have an opportunity to resolve the conflicts by hand (if they haven't already), for them to add the changes to the staging area, and for them to issue the git-commit command that completes the merge between X and Y. What's unclear about this?
Comment by kbolino 1 day ago
git merge -s ours --no-ff --no-commit <branch>
This will initiate a merge, take nothing from the incoming branch, and allow you to decide how to proceed. This leaves git waiting for your next commit, and the two branches will be considered merged when that commit happens. What you may want to do next is: git checkout -p <branch>
This will interactively review each incoming change, giving you the power to decide how each one should be handled. Once you've completed that process, commit the result and the merge is done.Comment by mkleczek 2 days ago
Comment by jcgl 2 days ago
> Jujutsu keeps track of conflicts as first-class objects in its model; they are first-class in the same way commits are, while alternatives like Git simply think of conflicts as textual diffs. While not as rigorous as systems like Darcs (which is based on a formalized theory of patches, as opposed to snapshots), the effect is that many forms of conflict resolution can be performed and propagated automatically.
Comment by PunchyHamster 2 days ago
Comment by zaphar 2 days ago
Comment by storystarling 2 days ago
Comment by 3eb7988a1663 2 days ago
Comment by theLiminator 2 days ago
Comment by speed_spread 2 days ago
Comment by rob74 2 days ago
Comment by theLiminator 2 days ago
Comment by giancarlostoro 2 days ago
Give me normal boring git merges over git squash merges.
Comment by p0w3n3d 2 days ago
Comment by iberator 2 days ago
I always forget all the flags and I work with literally just: clone, branch, checkout, push.
(Each feature is a fresh branch tho)
Comment by chungy 2 days ago
Take out the last "/timeline" component of the URL to clone via Fossil: https://chiselapp.com/user/chungy/repository/test/timeline
See also, the upstream documentation on branches and merging: https://fossil-scm.org/home/doc/trunk/www/branching.wiki
Comment by darkryder 2 days ago
For others, I highly recommend Git from the Bottom Up[1]. It is a very well-written piece on internal data structures and does a great job of demystifying the opaque git commands that most beginners blindly follow. Best thing you'll learn in 20ish minutes.
Comment by MarsIronPI 2 days ago
[0]: https://tom.preston-werner.com/2009/05/19/the-git-parable
Comment by spuz 2 days ago
Comment by sanufar 2 days ago
Comment by teiferer 2 days ago
Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.
Comment by TonyStr 2 days ago
Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.
The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).
Comment by tonnydourado 2 days ago
Comment by Phelinofist 2 days ago
Comment by Phelinofist 2 days ago
(block_ai) {
@ai_bots {
header_regexp User-Agent (?i)(anthropic-ai|ClaudeBot|Claude-Web|Claude-SearchBot|GPTBot|ChatGPT-User|Google-Extended|CCBot|PerplexityBot|ImagesiftBot)
}
abort @ai_bots
}
Then, in a specific app block include it via import block_aiComment by seba_dos1 2 days ago
Comment by zaphar 2 days ago
Comment by Zambyte 2 days ago
blocking openai ips did wonders for the ambient noise levels in my apartment. they're not the only ones obviously, but they're they only ones i had to block to stay sane
Comment by MarsIronPI 2 days ago
Comment by Zambyte 2 days ago
Comment by MarsIronPI 2 days ago
Comment by nerdponx 2 days ago
Comment by teiferer 2 days ago
Comment by adastra22 2 days ago
Comment by below43 2 days ago
Comment by program_whiz 2 days ago
A kind of "they found this code, therefore you have a duty not to poison their model as they take it." Meanwhile if I scrape a website and discover data I'm not supposed to see (e.g. bank details being publicly visible) then I will go to jail for pointing it out. :(
Comment by nerdponx 2 days ago
Comment by wredcoll 2 days ago
Living in a country with hundreds of millions of other civilians or a city with tens of thousands means compromising what you're allowed to do when it affects other people.
There's a reason we have attractive nuisance laws and you aren't allowed to put a slide on your yard that electrocutes anyone who touches it.
None of this, of course, applies to "poisoning" llms, that's whatever. But all your examples involved actual humans being attacked, not some database.
Comment by program_whiz 2 days ago
Comment by teo_zero 2 days ago
Comment by 0x696C6961 2 days ago
Comment by teiferer 2 days ago
> It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.
That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)
The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.
Comment by storystarling 2 days ago
Comment by teiferer 1 day ago
Comment by adastra22 2 days ago
Comment by wasmainiac 2 days ago
Comment by jdiff 2 days ago
Comment by falcor84 2 days ago
Comment by whstl 2 days ago
Comment by prmoustache 2 days ago
Comment by stevekemp 2 days ago
Comment by sailfast 2 days ago
For most throughout history, whatever is presented to you that you believe is the right answer. AI just brings them source information faster so what you're seeing is mostly just the usual behavior, but faster. Before AI people would not have bothered to try and figure out an answer to some of these questions. It would've been too much work.
Comment by topaz0 2 days ago
Comment by keybored 2 days ago
Comment by andy_ppp 2 days ago
Comment by jama211 2 days ago
Comment by jama211 2 days ago
Comment by mexicocitinluez 2 days ago
Great argument for not using AI-assisted tools to write blog posts (especially if you DO use these tools). I wonder how much we're taking for granted in these early phases before it starts to eat itself.
Comment by jama211 2 days ago
Comment by mexicocitinluez 2 days ago
Comment by jama211 1 day ago
Comment by anu7df 2 days ago
Comment by prodigycorp 2 days ago
One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.
Comment by evntdrvn 2 days ago
I wish I could find it again, if someone else knows the link please post it!
Comment by gxnxcxcx 2 days ago
Comment by tverbeure 2 days ago
This part made me laugh though:
> These detectors, as I understand them, often work by measuring two key things: ‘Perplexity’ and ‘burstiness’. Perplexity gauges how predictable a text is. If I start a sentence, "The cat sat on the...", your brain, and the AI, will predict the word "floor."
I can't be the only one who's brain predicted "mat" ?
Comment by evntdrvn 13 hours ago
Comment by awesome_dude 2 days ago
I do know that LLMs generate content heavy with those constructs, but they didn't create the ideas out of thin air, it was in the training set, and existed strongly enough that LLMs saw it as common place/best practice.
Comment by blenderob 2 days ago
Comment by prodigycorp 2 days ago
Comment by prodigycorp 2 days ago
Comment by gkbrk 2 days ago
Comment by brendoncarroll 2 days ago
Notable differences: E2E encryption, parallel imports (Got will light up all your cores), and a data structure that supports large files and directories.
Comment by rtkwe 2 days ago
Comment by brendoncarroll 2 days ago
Yeah, totally agree. Got has not solved conflict resolution for arbitrary files. However, we can tell the user where the files differ, and that the file has changed.
There is still value in being able to import files and directories of arbitrary sizes, and having the data encrypted. This is the necessary infrastructure to be able to do distributed version control on large amounts of private data. You can't do that easily with Git. It's very clunky even with remote helpers and LFS.
I talk about that in the Why Got? section of the docs.
Comment by DASD 2 days ago
Comment by brendoncarroll 2 days ago
Comment by p4bl0 2 days ago
Comment by UltraSane 2 days ago
Comment by TonyStr 2 days ago
Bookmarked for later
Comment by mfashby 2 days ago
Comment by sluongng 2 days ago
I think theoratically, Git delta-compression is still a lot more optimized for smaller repos. But for bigger repos where sharding storaged is required, path-based delta dictionary compression does much better. Git recently (in the last 1 year) got something called "path-walk" which is fairly similar though.
Comment by sublinear 2 days ago
I know this is only meant to be an educational project, but please avoid yaml (especially for anything generated). It may be a superset of json, but that should strongly suggest that json is enough.
I am aware I'm making a decade old complaint now, but we already have such an absurd mess with every tool that decided to prefer yaml (docker/k8s, swagger, etc.) and it never got any better. Let's not make that mistake again.
People just learned to cope or avoid yaml where they can, and luckily these are such widely used tools that we have plenty of boilerplate examples to cheat from. A new tool lacking docs or examples that only accepts yaml would be anywhere from mildly frustrating to borderline unusable.
Comment by oldestofsports 2 days ago
I had a go at it as well a while back, I call it "shit" https://github.com/emanueldonalds/shit
Comment by hahahahhaah 2 days ago
Comment by tpoacher 2 days ago
Comment by temporallobe 2 days ago
Comment by alsetmusic 2 days ago
It's only in the context of recreating Git that this comment makes sense.
Comment by igorw 2 days ago
Comment by nasretdinov 2 days ago
P.S. Didn't know that plain '@' can be used instead of HEAD, but I guess it makes sense since you can omit both left and right parts of the expressions separated by '@'
Comment by sneela 2 days ago
Why not tvc-hub :P
Jokes aside, great write up!
Comment by TonyStr 2 days ago
Comment by KolmogorovComp 2 days ago
Content-based chunking like Xethub uses really should become the default. It’s not like it’s new either, rsync is based on it.
Comment by h1fra 2 days ago
And this way of versionning can be reused in other fields, as soon as have some kind of graph of data that can be modified independently but read all together then it makes sense.
Comment by kgeist 2 days ago
How about using sqlite for this? Then you wouldn't need to parse anything, just read/update tables. Fast indexing out of the box, too.
Comment by grenran 2 days ago
Comment by dchest 2 days ago
It's basically plaintext. Even deltas are plaintext for text files.
Reason: "The global state of a fossil repository is kept simple so that it can endure in useful form for decades or centuries. A fossil repository is intended to be readable, searchable, and extensible by people not yet born."
Comment by TonyStr 2 days ago
[0] https://fossil-scm.org/home/doc/trunk/www/fossil-v-git.wiki#...
Comment by smartmic 2 days ago
[0]: https://fossil-scm.org/home/doc/trunk/www/rebaseharm.md
Comment by jact 2 days ago
I also use Fossil for lots of weird things. I created a forum game using Fossil’s ticket and forum features because it’s so easy to spin up and for my friends to sign in to.
At work we ended up using Fossil in production to manage configuration and deployment in a highly locked down customer environment where its ability to run as a single static binary, talk over HTTP without external dependencies, etc. was essential. It was a poor man’s deployment tool, but it performed admirably.
Fossil even works well as a blogging platform.
Comment by embedding-shape 2 days ago
I really enjoy how local-first it is, as someone who sometimes work without internet connection. That the data around "work" is part of the SCM as well, not just the code, makes a lot of sense to me at a high-level, and many times I wish git worked the same...
Comment by usrbinbash 2 days ago
But yeah, fossil is interesting, and it's a crying shame its not more well known, for the exact reasons you point out.
Comment by embedding-shape 2 days ago
It isn't though, Fossil integrates all the data around the code too in the "repository", so issues, wiki, documentation, notes and so on are all together, not like in git where most commonly you have those things on another platform, or you use something like `git notes` which has maybe 10% of the features of the respective Fossil feature.
It might be useful to scan through the list of features of Fossil and dig into it, because it does a lot more than you seem to think :) https://fossil-scm.org/home/doc/trunk/www/index.wiki
Comment by adastra22 2 days ago
Comment by embedding-shape 2 days ago
If you don't trust me, read the list of features and give it a try yourself: https://fossil-scm.org/home/doc/trunk/www/index.wiki
Comment by adastra22 2 days ago
Comment by embedding-shape 1 day ago
Comment by graemep 2 days ago
It is very easy to self host.
Not having staging is awkward at first but works well once you get used to it.
I prefer it for personal projects. In think its better for small teams if people are willing to adjust but have not had enough opportunities to try it.
Comment by TonyStr 2 days ago
Comment by graemep 2 days ago
I think the ethos is to discourage it.
It does not seem to be possible to commit just specific lines.
Comment by jact 2 days ago
Comment by justabrowser 2 days ago
Comment by adzm 2 days ago
Comment by storystarling 2 days ago
Comment by SQLite 2 days ago
Comment by eru 2 days ago
That's a weird thing to put so close to the start. Compression is about the least interesting aspect of Git's design.
Comment by alphabetag675 2 days ago
Comment by eru 1 day ago
It's just that git does a much more interesting job with compression, actually. Lot's more to learn. They don't compress the snapshots via something like zstd directly, that comes much later after a delta step. (Interestingly, that delta compression step doesn't use the diffs that `git show` shows you for your commits.)
Comment by astinashler 2 days ago
Comment by lucasoshiro 2 days ago
Comment by TonyStr 2 days ago
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
Running `target/debug/tvc decompress f854e0b307caf47dee5c09c34641c41b8d5135461fcb26096af030f80d23b0e5`
=== args ===
decompress
f854e0b307caf47dee5c09c34641c41b8d5135461fcb26096af030f80d23b0e5
=== tvcignore ===
./target
./.git
./.tvc=== subcommand === decompress ------------------ tree ./src/empty-folder e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 blob ./src/main.rs fdc4ccaa3a6dcc0d5451f8e5ca8aeac0f5a6566fe32e76125d627af4edf2db97
Comment by woodrowbarlow 2 days ago
Comment by heckelson 2 days ago
Comment by TonyStr 2 days ago
Comment by mg794613 2 days ago
Hmm, dont be so hard on yourself!
proceeds to call ls from rust
Ok nevermind, although I dont think rust is the issue here.
(Tony I'm joking, thanks for the article)
Comment by bryan2 2 days ago
I wonder if signing sha-1 mitigates the threat of using an outdated hash.
Comment by athrowaway3z 2 days ago
Comment by aabbcc1241 2 days ago
Comment by direwolf20 2 days ago
Comment by jrockway 2 days ago
Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/
It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.
Comment by EdSchouten 2 days ago
Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.
For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.
Comment by oconnor663 2 days ago
Comment by EdSchouten 1 day ago
Comment by grumbelbart2 2 days ago
Comment by oconnor663 2 days ago
Comment by holoduke 2 days ago
Comment by prakhar1144 2 days ago
"What's inside .git ?" - https://prakharpratyush.com/blog/7/
Comment by lasgawe 2 days ago
Comment by ofou 2 days ago
Comment by smangold 2 days ago
Comment by b1temy 2 days ago
The `tvc ls` command seems to always recompute the hash for every non-ignored file in the directory and its children. Based on the description in the blog post, it seems the same/similar thing is happening during commits as well. I imagine such an operation would become expensive in a giant monorepo with many many files, and perhaps a few large binary files thrown in.
I'm not sure how git handles it (if it even does, but I'm sure it must). Perhaps it caches the hash somewhere in the `.git`directory, and only updates it if it senses the file hash changed (Hm... If it can't detect this by re-hashing the file and comparing it with a known value, perhaps by the timestamp the file was last edited?).
> Git uses SHA-1, which is an old and cryptographically broken algorithm. This doesn't actually matter to me though, since I'll only be using hashes to identify files by their content; not to protect any secrets
This _should_ matter to you in any case, even if it is "just to identify files". If hash collisions (See: SHAttered, dating back to 2017) were to occur, an attacker could, for example, have two scripts uploaded in a repository, one a clean benign script, and another malicious script with the same hash, perhaps hidden away in some deeply nested directory, and a user pulling the script might see the benign script but actually pull in the malicious script. In practice, I don't think this attack has ever happened in git, even with SHA-1. Interestingly, it seems that git itself is considering switching to SHA-256 as of a few months ago https://lwn.net/Articles/1042172/
I've not personally heard of the process of hashing to also be known as digesting, though I don't doubt that it is the case. I've mostly familiar of the resulting hash being referred to as the message digest. Perhaps it's to differentiate between the verb 'hash' (the process of hashing) with the output 'hash' (the ` result of hashing). And naming the function `sha256::try_digest`makes it more explicit that it is returning the hash/digest. But it is a bit of a reach, perhaps that are just synonyms to be used interchangeably as you said.
On a tangent, why were TOML files not considered at the end? I've no skin in the game and don't really mind either way, but I'm just curious since I often see Rust developers gravitate to that over YAML or JSON, presumably because it is what Cargo uses for its manifest.
--
Also, obligatory mention of jujutsu/jj since it seems to always be mentioned when talking of a VCS in HN.
Comment by TonyStr 2 days ago
In my lazy implemenation, I don't even check if the hashes match, the program reads, compresses and tries to write the unchanged files. This is an obvious area to improve performance on. I've noticed that git speeds up object lookups by generating two-letter directories from the first two letters in hashes, so objects aren't actually stored as `.git/objects/asdf12ha89k9fhs98...`, but as `.git/objects/as/df12ha89k9fhs98...`.
>why were TOML files not considered at the end I'm just not that familiar with toml. Maybe that would be a better choice! I saw another commenter who complained about yaml. Though I would argue that the choice doesn't really matter to the user, since you would never actually write a commit object or a tree object by hand. These files are generated by git (or tvc), and only ever read by git/tvc. When you run `git cat-file <hash>`, you'll have to add the `-p` flag (--pretty) to render it in a human-readable format, and at that point it's just a matter of taste whether it's shown in yaml/toml/json/xml/special format.
Comment by b1temy 2 days ago
I agree, but I'm still iffy on reading all files (already an expensive operation) in the repository, then hashing every one of them, every time you do an ls or a commit. I took a quick look and git seems to check whether it needs to recalculate the hash based on a combination of the modification timestamp and if the filesize has changed, which is not foolproof either since the timestamp can be modified, and the filesize can remain the same and just have different contents.
I'm not too sure how to solve this myself. Apparently this is a known thing in git and is called the "racy git" problem https://git-scm.com/docs/racy-git/ But to be honest, perhaps I'm biased from working in a large repository, but I'd rather the tradeoff of not rehashing often, rather than suffer the rare case of a file being changed without modifying its timestamp, whilst remaining the same size. (I suppose this might have security implications if an attacker were to place such a file into my local repository, but at that point, having them have access to my filesystem is a far larger problem...)
> I'm just not that familiar with toml... Though I would argue that the choice doesn't really matter to the user, since you would never actually write...
Again, I agree. At best, _maybe_ it would be slightly nicer for a developer or a power user debugging an issue, if they prefer the toml syntax, but ultimately, it does not matter much what format it is in. I mainly asked out of curiosity since your first thoughts were to use yaml or json, when I see (completely empirically) most Rust devs prefer toml, probably because of familiarity with Cargo.toml. Which, by the way, I see you use too in your repository (As to be expected with most Rust projects), so I suppose you must be at least a little bit familiar with it, at least from a user perspective. But I suppose you likely have even more experience with yaml and json, which is why it came to mind first.
Comment by TonyStr 2 days ago
Oh that is interesting. I feel like the only way to get a better and more reliable solution to this would be to have the OS generate a hash each time the file changes, and store that in file metadata. This seems like a reasonable feature for an OS to me, but I don't think any OS does this. Also, it would force programs to rely on whichever hashing algorithm the OS uses.
Comment by b1temy 2 days ago
I'm not sure I would want this either tbh. If I have a 10GB file on my filesystem, and I want to fseek to a specific position in the file and just change a single byte, I would probably not want it to re-hash the entire file, which will probably take a minute longer compared to not hashing the file. (Or maybe it's fine and it's fast enough on modern systems to do this every time a file is modified by any program running, I don't know how much this would impact the performance.).
Perhaps a higher resolution timestamp by the OS might help though, for decreasing the chance of a file having the exact same timestamp (unless it was specifically crafted to have been so).
Comment by quijoteuniv 2 days ago
Comment by smekta 2 days ago
Comment by jonny_eh 2 days ago
Comment by black_13 2 days ago