A type-safe, realtime collaborative Graph Database in a CRDT

Posted by phpnode 12 hours ago

Comments

Comment by AlotOfReading 8 hours ago

I'm not terribly familiar with graph databases, but perhaps someone who is can explain the advantage of this awfully complicated seeming design. There's gremlin, cypher, yjs, and zod, all of which I understand are different languages for different problems.

What's the advantage of using all these different things in one system? You can do all of this in datalog. You get strong eventual consistency naturally. LLMs know how to write it. It's type safe. JS implementations exist [0].

[0] https://github.com/tonsky/datascript

Comment by phpnode 8 hours ago

Gremlin-like API gives end to end type safety if you're querying the database from TypeScript. This was the original motivation for the library.

Zod/Valibot/ArkType/Standard Schema support because you need a way to define your schema and this allows for that at runtime and compile time.

Y.js as a backing store because I needed to support offline sync, branching/forking, and I use Y.js for collaborative editing in my product, so I needed to be able to store the various CRDT types as properties within the graph. e.g. you can have a `description` property on your vertices or edges that is backed by a Y.Text or Y.XmlElement

Cypher because until the arrival of codemode it wasn't feasible to have LLMs write queries using the Gremlin-like API and LLMs already know Cypher.

Most of all though, this was an experiment that ended up being useful.

Comment by UltraSane 8 hours ago

The advantage for property graph databases using Cypher query language is that the queries for things like "show me all systems connected to this system by links greater than 10Gbps up to n hops away" are vastly easier to write and faster to complete compared to SQL and relational databases. Cypher lets you easily search for arbitrary graph patters and the result is also a graph, not a denormalized table.

Comment by cmrdporcupine 3 hours ago

Parent commenter was asking compare to datalog (not SQL) which eats recursive graph transitions like this for lunch, making the queries very elegant to read ... while still staying relational.

I'm personally of the opinion that "graph databases" should be relational databases; the relational model can subsume "graph" queries, but for all the reasons Codd laid out back in the 60s... network (aka connected graph) databases cannot do the latter.

Let the query planner figure out the connectivity story, not a hardcoded data model.

  % 1. Base case: Directly connected systems (1 hop) with   bandwidth > 10
  fast_path(StartSys, EndSys, 1) :- 
      link(StartSys, EndSys, Bandwidth), 
      Bandwidth > 10.

  % 2. Recursive case: N-hop connections via an intermediate system
  fast_path(StartSys, EndSys, Hops) :- 
      fast_path(StartSys, IntermediateSys, PrevHops), 
      link(IntermediateSys, EndSys, Bandwidth), 
      Bandwidth > 10,
      Hops = PrevHops + 1.

  % 3. The Query: Find all systems connected to 'System_A' within 5 hops
  ?- fast_path('System_A', TargetSystem, Hops), Hops <= 5.

or in RelationalAI's "Rel" language, such as I remember it, this is AI assisted it could be wrong:

  // 1. Base case: Directly connected systems (1 hop)
  def fast_path(start_sys, end_sys, hops) =
    exists(bw: link(start_sys, end_sys, bw) and bw > 10 and hops = 1)

  // 2. Recursive case: Traverse to the next system
  def fast_path(start_sys, end_sys, hops) =
    exists(mid_sys, prev_hops, bw:
      fast_path(start_sys, mid_sys, prev_hops) and
      link(mid_sys, end_sys, bw) and bw > 10 and hops = prev_hops + 1)

  // 3. The Query: Select targets connected to "System_A" within 5 hops
  def output(target_sys, hops) =
    fast_path("System_A", target_sys, hops) and hops <= 5

https://www.relational.ai/post/graph-normal-form

https://www.dataversity.net/articles/say-hello-to-graph-norm...

...

That said, modern SQL can do this just fine, just... much harder to read.

  WITH RECURSIVE fast_path AS (
    -- 1. Base case: Directly connected systems from our starting node
    SELECT
      start_sys,
      end_sys,
      1 AS hops
    FROM link
    WHERE start_sys = 'System_A' AND bandwidth > 10
    UNION ALL

    -- 2. Recursive case: Traverse to the next system
    SELECT 
      fp.start_sys, 
      l.end_sys, 
      fp.hops + 1
    FROM fast_path fp
    JOIN link l ON fp.end_sys = l.start_sys
    WHERE l.bandwidth > 10 AND fp.hops < 5
  )

  -- 3. The Query: Select the generated graph paths
  SELECT * FROM fast_path;

Comment by ghc 15 minutes ago

> the relational model can subsume "graph" queries, but for all the reasons Codd laid out back in the 60s... network (aka connected graph) databases cannot do the latter.

Except network databases have little in common with graph databases...they're much more closely related to hierarchical databases.

I would say that Graph databases are now a strict superset of relational databases, not the other way around. In a graph database a node's named edge can naturally point to a node of any type or having any property schema. Doing this in a relational model requires one of several approaches that could only be classified as fighting against the model (or torturing it, as my PI liked to say).

Comment by UltraSane 2 hours ago

JOINS make these kinds of queries get slower as the number of hops gets larger. And property graph databases have the big advantage of not having to mutilate their query results to fit into a flat table. A path query returns a path object of connected nodes. Property graphs are superior for applications with deep, variable-length connections, such as social networks, recommendation engines, fraud detection, and IT network mapping. Property graph databases work well with object oriented programming where objects map to nodes very well.

RelationalAI's model is very cool but it is cloud only software.

Comment by cmrdporcupine 2 hours ago

It'd going to get slower as the number of hops get bigger regardless of whether you accomplish that via relational join or pointer traversal. And doing it the former gives great opportunity for optimization during execution through vectorization or more appropriate index data structure usage.

Modern computers are also much more efficient at batch / vector processing than they are pointer hops. By either CPU or GPU, the whole system is much more tuned for working with data as sets / tensors rather than "hopping" through the data via pointer.

How you materialize your results for processing is your own business. The advantage of the relational model is its consistent throughout; the source data, the manipulation format for the operators, and the output are all relations. You can visualize it as "flat table" but that's an unimaginative visualization. You can just as easily twist that into a nested hierarchical structure in an object oriented language if you're so unfortunate as to be stuck working that way.

A "table" is only a visualization of the data, not the "form" the data actually takes and it's unfortunate that SQL chose this word.

It's better to conceive of a relation as a series of facts or propositions about the world. Each "row" is a statement. When read that way, it's a lot more elegant.

Comment by UltraSane 2 hours ago

Neo4j-style index-free adjacency gives you O(1) per-hop neighbor lookup per node; cost is proportional to the subgraph actually visited. A k-way self-join on a relational edge table has to produce and filter intermediate tuples at each join step, and intermediate result size can blow up multiplicatively even if the final result is small. A k-way self-join on a relational edge table has to produce and filter intermediate tuples at each join step, and intermediate result size can blow up multiplicatively even if the final result is small. For deep traversals with selective endpoints, the relational plan's intermediate materialization is exactly the problem native graph engines were built to avoid.

You can just as easily see nodes and edges in a property graph as propositions about the world. The nice thing is that you can model relationships between entities as first class entities. nodes have the implicit property of being non-fungible.

Do you know of any relational database that returns a query result as normalized tables the way neo4j returns a sub-graph?

Comment by cmrdporcupine 1 hour ago

Honestly, I won't convince you, so I'll leave you to the pleasure of your pointer chasing, L1 cache misses, multiplicative blowups, and fixed / inflexibly structured taxonomies.

Comment by UltraSane 1 hour ago

In practice if you index the right properties and have the right relationships in Neo4j then queries are very fast. And neo4j doesn't have taxonomies. You might be thinking of RDF. One of the best things about Neo4j compared to relational databases is the lack of a rigid schema.

Comment by 40four 8 hours ago

I’ve recently gotten obsessed with local first app architecture, so I’m really digging into CRDT and trying to get familiar with it. So this looks very interesting to me. Thanks for posting!

Comment by 2ndorderthought 11 hours ago

Can anyone explain why it is a good idea to make a graphdb in typescript? This not a language flamewar question, more of an implementation details question.

Though typescript is pretty fast, and the language is flexible, we all know how demanding graph databases are. How hard they are to shard, etc. It seems like this could be a performance trap. Are there successful rbdms or nosql databases out there written in typescript?

Also why is everything about LLMs now? Can't we discuss technologies for their face value anymore. It's getting kind of old to me personally.

Comment by phpnode 11 hours ago

I needed it to be possible to run the graph in the browser and cloudflare workers, so TS was a natural fit here. It was built as an experiment into end to end type safety - nothing to do with LLMs, but it ended up being useful in the product I'm building. It's not designed for large data sets.

Comment by rglullis 10 hours ago

> It's not designed for large data sets.

How large is large, here? Tens of thousands of triples? Hundreds? Millions?

I'm working on a local-first browser extension for ActivityPub, and currently I am parsing the JSON-LD and storing the triples in specialized tables on pglite to be able to make fast queries on that data.

It would be amazing to ditch the whole thing and just deal with triples based on the expanded JSON-LD, but I wonder how the performance would be. While using the browser extension for a week, the store accumulated ~90k thousand JSON-lD documents, which would probably mean 5 times as many triples. Storage wise is okay (~300MB), but I think that a graph database would only be useful to manage "hot data", not a whole archive of user activity.

Comment by phpnode 10 hours ago

it depends on the backing store (which is pluggable). I would not want to exceed let's say 50Mb in a Y.js doc, but i've tested the in memory storage with graphs approaching 1Gb and it's been fine - like any graph it really depends on how you query it. Most of the docs I'm dealing with in production are less than 10Mb, so this is fine for my use cases, but... buyer beware!

Comment by 2ndorderthought 11 hours ago

Makes sense thanks for explaining the use case. The LLM question was only because of the comments at the time of the post.

The query syntax looks nice by the way.

Comment by phpnode 11 hours ago

thanks, it was as close to Gremlin[0] as I could get without losing type safety (Gremlin is untyped)

[0] https://tinkerpop.apache.org/

Comment by rapnie 8 hours ago

Gleam might be a great choice perhaps. Compiles to typescript and Erlang/BEAM.

Comment by ForHackernews 9 hours ago

You can use webassembly in both places https://developers.cloudflare.com/workers/runtime-apis/webas...

Comment by rigonkulous 6 hours ago

>Also why is everything about LLMs now? Can't we discuss technologies for their face value anymore. It's getting kind of old to me personally.

Fully agree with you, LLM's everywhere is getting churlsome, and trite. Sure, one can generate the code, but can one talk about the code?

Can one move, easily, beyond "but why should I?" into "I should because ..", whenever it comes to some realm, one is supposdely to have "conquered" with AI/ML/etc.?

Sure, we can all write great specs, sit back and watch the prompts rolling in off the horizon...

But seriously: the human reasoning of it is the most important part!

However everyone is busy building things secretly that they don't want anyone to know about with AI/ML (except the operators of course, duh, lol, kthxbai...) because, after all, then anyone else can do it .. and in this secrecy, human literature is transforming - imho - in non-positive ways, for the future.

Kids need to read books, and quick!

>old to me personally

I kind of regret seeing it in one of my favourite realms, retro- computing .. but .. what can we do, it is here and now and the kids are using it even if some don't.

I very much concur with you on the desire to continue being able to discuss technologies at face value, personally. We have to be reviewing the code; the point at which we don't review AI's output, is where we become subjects of it.

This is, probably, like the very early days of porno, or commercial tobacco, or indeed industrialized combustion, wherein it kind of took society a few big leaps and dark tumbles before the tech sort of stabilized. I'm not sure we'll get to the "Toyota" stages of AI, or indeed we're just going to blast right past into having AI control literally every device under the sun, whether we like it or not (and/or, have an impact on the technologically-obsolete curve, -/+'vely...)

Ain't no easily-accessible AI-hookery Typescript in 8-bit land .. I will have my retro- computing without that filthy stinking AI/ML please, mmkay, thanks!!!

Comment by voodooEntity 10 hours ago

I kinda feel ya.

I wrote my own in-Memory Graph (i'd rather call it storage than a DB) some years ago in golang, even there i was wondering if golang actually is the optimal technology for something like a database especially due to the garbage collection/stop the world/etc. Its just there are certain levels of optimization i will never be able to properly reach (lets ignore possible hacks). Looking at a solution in typescript, no matter how "nice" it looks, this just doesnt seem to be the correct "tool/technology" for the target.

And inb4, there are use cases for everything, and same as i wouldn't write a website in C, i also wouldn't write a database in javascript/typescript.

I just would argue this is the wrong pick.

@llms : im not even getting into this because if you dont wanne read "llm" you basically can't read 99% of news nowadays. ¯\_(ツ)_/¯

edit: im a big fan of graph databases so im happy about every public attention they get ^

Comment by rs545837 6 hours ago

Oh this is cool. The Yjs as storage backend trick is clever, you basically get CRDT sync for free without having to build your own replication layer. And the pluggable storage means you can develop against in-memory and then flip to YGraph for collab mode without touching your queries. That's a nice developer experience.

The live queries also caught my eye. Having traversals auto reexecute when data changes sounds straightforward until you realize the underlying data is being merged from multiple peers concurrently. Getting that right without stale reads or phantom edges is genuinely hard.

I've been researching on something like this in a similar space but for source code, therefore built a tool called Weave(https://github.com/Ataraxy-Labs/weave) for entity level merges for git. Instead of merging lines of text, it extracts functions, classes, and methods, builds a dependency graph between them, and merges at that level.

Seeing codemix makes me think there might be something interesting here. Right now our entity graph and our CRDT state are two separate things. The graph lives our analysis engine and the CRDT lives in different crate. If something like @codemix/graph could unify those, you'd have a single data structure where the entity dependency graph is the CRDT.

Comment by paul_h 3 hours ago

Semantic merge. PlasticSCM had that a feature many years back

Comment by lo1tuma 11 hours ago

15 years ago I was a big fan of this chaining methods pattern. These days I don’t like it anymore. Especially when it comes to unit-testing and implementing fake objects it becomes quite cumbersome to setup the exact same interface.

Comment by phpnode 11 hours ago

unfortunately it's unavoidable if you want to preserve type safety. I did consider parsing Cypher in typescript types, but it's not worth the effort and it's not possible to do safely.

Comment by rounce 10 hours ago

Why not with a pipe that returns a function, the type of which is determined by the args of the pipe? That is possible to make typesafe in TS. That way you can have both APIs where the chained version is just wrapping successive pipe calls.

Comment by brianbcarter 11 hours ago

Cypher-over-Gremlin is a smart call — LLMs can write Cypher, makes the MCP angle viable in a new way.

How dos Yjs handle schema migrations? If I add a property to a vertex type that existing peers have cached, does it conflict or drop the unknown field?

Comment by tevon 3 hours ago

The CRDT enables eventual consistency on these schema updates, so a new field will be eventually consistent

Comment by phpnode 9 hours ago

Instances that have not updated to the latest version of the schema will ignore the additional property but will not break or conflict.

Comment by cush 8 hours ago

The page keeps crashing on safari

Comment by esafak 8 hours ago

Got benchmarks?

Comment by cyanydeez 12 hours ago

Eventually someone will figure out how to use a graph database to allow an agent to efficiency build & cull context to achieve near determinant activities. Seems like one needs a sufficiently powerful schema and a harness that properly builds the graph of agent knowledge, like how ants naturally figure how where sugar is, when that stockpile depletes and shifts to other sources.

This looks neat, but if you want it to be used for AI purposes, you might want to show a schema more complicated than a twitter network.

Comment by j-pb 11 hours ago

Working on exactly that! We're local first, but do distributed sync with iroh. Written in rust and fully open source.

Imho having a graph database that is really easy to use and write new cli applications on top of works much better. You don't need strong schema validation so long as you can gracefully ignore what your schema doesn't expect by viewing queries as type/schema declarations.

https://github.com/magic-locker/faculties

Comment by local_surfer 7 hours ago

Interesting! Triblespace seems similar to TerminusDB and the solution presented here - would you mind stating the differences ?

Comment by j-pb 5 hours ago

Spot on!

In one word. Simplicity!

TerminusDB is a RDF database with build-in prolog, a heavy focus on succinct data structure indexes, and has a client server model.

Triblespace is purposefully not RDF, because RDF has horrible DX, and it does not have a good reconciliation and distributed consistency story.

It's virtually impossible to get RDF data into a canonical form [https://www.w3.org/TR/rdf-canon/#how-to-read] and trivial stuff like equality of two datasets is NP-Hard.

Triblespace, while also a data exchange standard like RDF, is closer to Datascript or Datomic. It's a Rust library, and great care has been taken to give it extremely nice DX.

In-memory datasets are cheaply clonable, and support efficient set operations. There's macros that integrate fully into the type system to perform data generation and queries.

    // The entity! macro returns a rooted fragment; merge its facts into
    // a TribleSet via `+=`.
    let herbert = ufoid();
    let dune = ufoid();
    let mut library = TribleSet::new();

    library += entity! { &herbert @
        literature::firstname: "Frank",
        literature::lastname: "Herbert",
    };

    library += entity! { &dune @
        literature::title: "Dune",
        literature::author: &herbert,
        literature::quote: ws.put(
            "I must not fear. Fear is the mind-killer."
        ),
    };

    ws.commit(library, "import dune");

    // `checkout(..)` returns a Checkout — a TribleSet paired with the
    // commits that produced it, usable for incremental delta queries.
    let catalog = ws.checkout(..)?;
    let title = "Dune";

    // Multi-entity join: find quotes by authors of a given title.
    // `_?author` is a pattern-local variable that joins without projecting.
    for (f, l, quote) in find!(
        (first: String, last: String, quote),
        pattern!(&catalog, [
            { _?author @
                literature::firstname: ?first,
                literature::lastname: ?last
            },
            { _?book @
                literature::title: title,
                literature::author: _?author,
                literature::quote: ?quote
            }
        ])
    ) {
        let quote: View<str> = ws.get(quote)?;
        let quote = quote.as_ref();
        println!("'{quote}'\n - from {title} by {f} {l}.");
    }

Data has a fully tracked history like in terminus, but we are overall more CRDT-like with multiple scopes of transactionality.

You can store stuff in either S3 or a single local file (for the local file you can union two databases by concatenating them with `cat`).

We also have just recently added sync through Iroh.

The core idea and main difference between RDF is that RDF is text based and weakly typed, we are binary and strongly typed.

We split everything into two basic structures: - the tribles (a pun on binary triple), 64byte units that are split into [16byte entity id | 16byte attribute id | 32byte Value] where the first two are basically high entropy identifiers like UUIDs, and the last is either a Blake3 hash, or an inlined <32byte value, with the type being disambiguated by metadata on the attribute ID (itself represented as more tribles) - blobs, content addressed, arbitrary length

It's pretty easy to see why canonical representations are pretty easy for us, we just take all of the tribles, sort them lexicographically, dedup them, store the resulting array in a blob. Done.

Everythign else is build up from that. Oh and we also have succinct datastructures, but because those are dense but slower, and immutable, we have a custom 256-ary radix trie to do all of the immutable set operations.

The query engine is also custom, we don't have a query planner which gives us 0.5-2.5microseconds of latency for queries depending on the number of joins, with a query engine that is fully extensible via traits in rust.

Comment by mentalgear 2 hours ago

This sounds all great - I just wish there was a JS/TS port to be able to use it in the browser or from node/deno/bun!

Comment by j-pb 2 hours ago

Fair, it actually started out in JS, moved to Deno, then Zig and ended in Rust.

If I ever find the time I'd like to back port what I have now, up the chain.

It is supposed to be a RDF replacement so it will eventually have to happen, but it's hard work to make everything extremely idiomatically integrated into the host language.

Comment by embedding-shape 12 hours ago

I'd wager the problem is on the side of "LLMs can't value/rank information good enough" rather than "The graph database wasn't flexible/good enough", but I'd be happy to be shown counter-examples.

I'm sure once that problem been solved, you can use the built-in map/object of whatever language, and it'll be good enough. Add save/load to disk via JSON and you have long-term persistence too. But since LLMs still aren't clever enough, I don't think the underlying implementation matters too much.

Comment by lmeyerov 11 hours ago

It's interesting to think of where the value comes from. Afaict 2 interesting areas:

A: One of the main lessons of the RAG era of LLMs was reranked multiretrieval is a great balance of test time, test compute, and quality at the expense of maintaining a few costly index types. Graph ended up a nice little lift when put alongside text, vector, and relational indexing by solving some n-hop use cases.

I'm unsure if the juice is worth the squeeze, but it does make some sense as infra. Making and using these flows isn't that conceptually complicated and most pieces have good, simple OSS around them.

B: There is another universe of richer KG extraction with even heavier indexing work. I'm less clear on the ROI here in typical benchmarks relative to case A. Imagine going full RDF, vs the simpler property graph queries & ontologies here, and investing in heavy entity resolution etc preprocessing during writes. I don't know how well these improve scores vs regular multiretrieval above, and how easy it is to do at any reasonable scale.

In practice, a lot of KG work lives out of the DB and agent, and in a much fancier kg pipeline. So there is a missing layer with less clear proof and a value burden.

Seperately, we have been thinking about these internally. We have been building gfql , oss gpu cypher queries on dataframes etc without needing a DB -- reuse existing storage tiers by moving into embedded compute tier -- and powering our own LLM usage has been a primary internal use case for us. Our experiences have led us to prioritizing case A as a next step for what the graph engine needs to support inside, and viewing case B as something that should live outside of it in a separate library . This post does make me wonder if case B should move closer into the engine to help streamline things for typical users, akin how solr/lucene/etc helped make elastic into something useful early on for search.

Comment by alansaber 11 hours ago

I'm conceptually very bullish on B (entity resolution and hierarchy pre-processing during writes). I'm less certain than A and B need to be merged into a single library. Obviously, a search agent should know the properties of the KG being searched, but as the previous poster mentioned, these graph dbs are inherently inaccurate and only form part of the retrieval pattern anyway.

Comment by lmeyerov 9 hours ago

Maybe it's useful to split out B1) KG pipelines from the choice of B2) simple property graph ontologies & queries vs advanced rdf ontologies and sparql queries

It sounds like you are thinking about KG pipelines, but I'm unclear on whether typed property graphs, vs more advanced RDF/SPARQL, is needed in your view on the graph engine side?

Comment by plaguuuuuu 9 hours ago

im pretty sure gastown (the Beads part) stores tasks/memories/whatever in a DAG but I haven't looked into it in detail

Comment by phpnode 12 hours ago

the airline graph is more complex, I can show the schema for that if you think it's useful?

Comment by _alphageek 1 hour ago

[dead]

Comment by llmradar 11 hours ago

[dead]