Distributed DuckDB Instance

Posted by citguru 7 days ago

Comments

Comment by herpderperator 7 days ago

Does this help with DuckDB concurrency? My main gripe with DuckDB is that you can't write to it from multiple processes at the same time. If you open the database in write mode with one process, you cannot modify it at all from another process without the first process completely releasing it. In fact, you cannot even read from it from another process in this scenario.

So if you typically use a file-backed DuckDB database in one process and want to quickly modify something in that database using the DuckDB CLI (like you might connect SequelPro or DBeaver to make changes to a DB while your main application is 'using' it), then it complains that it's locked by another process and doesn't let you connect to it at all.

This is unlike SQLite, which supports and handles this in a thread-safe manner out of the box. I know it's DuckDB's explicit design decision[0], but it would be amazing if DuckDB could behave more like SQLite when it comes to this sort of thing. DuckDB has incredible quality-of-life improvements with many extra types and functions supported, not to mention all the SQL dialect enhancements allowing you to type much more concise SQL (they call it "Friendly SQL"), which executes super efficiently too.

[0] https://duckdb.org/docs/current/connect/concurrency

Comment by szarnyasg 7 days ago

Hi, DuckDB DevRel here. To have concurrent read-write access to a database, you can use our DuckLake lakehouse format and coordinate concurrent access through a shared Postgres catalog. We released v1.0 yesterday: https://ducklake.select/2026/04/13/ducklake-10/

I updated your reference [0] with this information.

Comment by nrjames 6 days ago

Regarding documentation, I think the DuckLake docs would benefit from a relatively simple “When should I consider using DuckLake?” type FAQ entry. You have sections for what, how, and why, essentially, and a few simple use cases and/or case studies could help provide the aha moment to people in data jobs who are inundated with marketing from other companies. It would help folks like me understand under which circumstances I would stand to benefit most from using DuckLake.

Comment by szarnyasg 5 days ago

DuckDB devrel here. You are right. This was in the FAQ but I also added it to the DuckLake documentation's main page at https://ducklake.select/docs/stable/

Comment by nrjames 4 days ago

My employer is in the midst of migrating petabytes of data from Snowflake to DataBricks. They’re sold on the “all in one” nature of the platform and believe they’ll save significant money through a contract locking them into DataBricks running on Azure. It is a wildly disruptive process in an environment where the “Snowflake police” (as we call them) have been hounding everybody to reduce credit usage. Now the IT platform team is trying to explain units of work to non-technical VPs, for example, and there’s mass confusion. All signs point to them ending up in the same situation with expensive DataBricks bills, vendor lock in, and a future migration to try to reduce costs.

I guess what I was trying to say is that DuckLake isn’t even a blip on their radar. Should it be? Could you explain it to a non-technical marketing VP as part of a cost savings measure? What’s the DuckLake equivalent to a Unit of Work on DataBricks or a Snowflake Warehouse? If I needed to join multiple tables with billions of rows, where does the compute happen in DuckLake? Can you run your own cluster like with Clickhouse or StarRocks? How does it scale horizontally with storage and compute? How do I update it? What if there’s a security flaw? How well does it stand up to 500 people querying it simultaneously and what type of setup would I need to achieve that?

The PMs that manage the IT platform team aren’t necessarily deeply familiar with all of the technical details. A compelling introduction to DuckLake would provide the answer to some of these questions in a way that the VPs or PMs could digest it easily while providing the technical details the data workers require. For better or worse, “data lakehouse” and data warehouse and data lake all are industry jargon that is pretty impenetrable to people who don’t spend a lot of time working with the tools but who cut checks and make decisions.

Comment by citguru 6 days ago

Hi,

DuckLake is great for the lakehouse layer and it's what we use in production. But there's a gap and thats what I'm trying to address with OpenDuck. DuckLake do solve concurrent access at the lakehouse/catalog level and table management.

But the moment you need to fall back to DuckDB's own compute for things DuckLake doesn't support yet, you're back to a single .duckdb file with exclusive locking. One process writes, nobody else reads.

OpenDuck sits at a different layer. It intercepts DuckDB's file I/O and replaces it with a differential storage engine which is append-only layers with snapshot isolation.

Comment by citguru 6 days ago

Yes, this is actually one of the core problems OpenDuck's architecture addresses.

The short version: OpenDuck interposes a differential storage layer between DuckDB and the underlying file. DuckDB still sees a normal file (via FUSE on Linux or an in-process FileSystem on any platform), but underneath, writes go to append-only layers and reads are resolved by overlaying those layers newest-first. Sealing a layer creates an immutable snapshot.

This gives you:

Many concurrent readers: each reader opens a snapshot, which is a frozen, consistent view of the database. They don't touch the writer's active layer at all. No locks contended.

One serialized write path: multiple clients can submit writes, but they're ordered through a single gateway/primary rather than racing on the same file. This is intentional: DuckDB's storage engine was never designed for multi-process byte-level writes, and pretending otherwise leads to corruption. Instead, OpenDuck serializes mutations at a higher level and gives you safe concurrency via snapshots.

So for your specific scenario — one process writing while you want to quickly inspect or query the DB from the CLI — you'd be able to open a read-only snapshot mount (or attach with ?snapshot=<uuid>) from a second process and query freely. The writer keeps going, new snapshots appear as checkpoints seal, and readers can pick up the latest snapshot whenever they're ready.

It's not unconstrained multi-writer OLTP (that's an explicit non-goal), but it does solve the "I literally cannot even read the database while another process has it open" problem that makes DuckDB painful in practice.

Comment by jeadie 6 days ago

This is exactly what we found. Ingest rates were tough. We partitioned and ran over multiple duckdb instances too (and wrangled the complexity).

We ending up building a Sqlite + vortex file alternative for our use case: https://spice.ai/blog/introducing-spice-cayenne-data-acceler...

Comment by 6 days ago

Comment by wenc 6 days ago

Try DuckLake. They just released a prod version.

You can do read/write of a parquet folder on your local drive, but managed by DuckLake. Supports schema evolution and versioning too.

Basically SQLite for parquet.

Comment by nehalem 7 days ago

I have a deep appreciation for DuckDB, but I am afraid the confluence of brilliant ideas makes it ever more complicated to adopt —- and DuckLake is another example for this trend.

When I look at SQLite I see a clear message: a database in a file. I think DuckDb is that, too. But it’s also an analytics engine like Polars, works with other DB engines, supports Parquet, comes with a UI, has two separate warehouse ideas which both deviate from DuckDB‘s core ideas.

Yes, DuckLake and Motherduck are separate entities, but they are still part of the ecosystem.

Comment by samansmink 6 days ago

that's a valid concern!

However I'd like to point out that that is exactly the reason why DuckDB relies so heavily on its extension mechanism, even for features that some may consider to be "essential" for an analytical system. Take for example the parquet, json, and httpfs extensions. Also features like the UI you mention are isolated from core DuckDB by living in an extension.

I'd argue that core DuckDB is still very much the same lightweight, portable, no-dependency system that it started out as (and which was very much inspired by how effective SQLite is by being so).

Maybe some interesting behind-the-scenes: to further solidify core DuckDB and guard it from the complexity of its ever growing extension ecosystem, one of the big items currently on our roadmap (see https://duckdb.org/roadmap) is to make significant improvements to DuckDB's stable C extension API.

disclaimer: I work at DuckDB Labs ;)

Comment by nehalem 6 days ago

Thank you for your thoughtful reply. The extension system makes great sense.

But it's also stuff like `"SELECT * FROM my_df"` – It's super cool but why is my database connecting to an in-memory pandas data frame? On the other hand, DuckDB can connect to remote Parquet files and interact with them without (explicitly) importing them.

In these examples, DuckDB feels more like an ephemeral SQL-esque Pandas/Polars alternative rather than a database.

Probably it's just me losing track of what a database is and we've evolved from "a monolithic and permanent thing that you store data on and read data from via queries".

Comment by swasheck 6 days ago

i think "SELECT * FROM my_df" is a convenience from the python module and how tightly integrated it is, but i can’t get this to replicate using the cli or dbeaver or datagrip.

and yes, being able to layer analytical sql on top of your csv/json/parquet/gpx/arrow (but not xml?) is the massive appeal of duckdb for a variety of reasons. it’s a paradigm shift for me as an old timer but it’s also suited my needs quite well over the past few years

Comment by atombender 6 days ago

How does this (or DuckLake for that matter) handle sparseness and fragmentation of the differential storage? My experience with B+trees, at least, is that pages get spread all over the place, so if you run a normal query, page 537 may be in layer 1, page 8374 in layer 2, and so on, and a single query might need hundreds or thousands of pages, too scattered to read efficiently in large sequential reads, but requiring a lot of random ones, which in turn means latency is very poor unless you aggressively cache. Neon deals with this through compaction and prewarming, I believe. Maybe DuckDB avoids this because column data tends to be more sequential, and something batches up bigger layers? Or maybe aggressive layer compaction?

Comment by pepperoni_pizza 6 days ago

I think the answer is "all of the above".

Columnar storage is very effectively compressed so one "page" actually contains a lot of data (Parquet rowgroups default to 100k records IIRC). Writing usually means replacing the whole table once a day or appending a large block, not many small updates. And reading usually would be full scans with smart skipping based on predicate pushdown, not following indexes around.

So the same two million row table that in a traditional db would be scattered across many pages might be four files on S3, each with data for one month or whatnot.

But also in this space people are more tolerant of latency. The whole design is not "make operations over thousands of rows fast" but "make operations over billions of rows possible and not slow as a second priority".

Comment by atombender 6 days ago

Good points. I don't have a lot of experience with DuckDB in a production setting, but my team uses ClickHouse, where we ingest log and instrumentation data into materialized views at high volume. What I think saves the segmented/layered architecture there (ClickHouse calls them parts, but it's fundamentally the same thing) is that it's append-only, which means the "layers" don't go backwards, and a single row will never appear in more than one layer. But with a B+tree, the entire tree is mutable.

Comment by szarnyasg 6 days ago

DuckLake does not use B+ trees and it handles fragmentation with techniques like partial files and compaction upon checkpointing. The developers of DuckLake talks about this here: https://youtu.be/7Su0aVzbb-U?t=689

(Disclaimer: I work at DuckDB Labs)

Comment by citguru 6 days ago

[dead]

Comment by citguru 7 days ago

This is an attempt to replicate MotherDucks differential storage and implement hybrid query execution on DuckDB

Comment by zurfer 7 days ago

As someone working in the field I have to admit that I'm not familiar with the terms differential storage nor do I really understand what hybrid execution means. Maybe you could describe it both from a simple technical point of view and what benefits it has to me as a user?

Comment by skeeter2020 6 days ago

from the post:

Differential storage

Append-only layers with PostgreSQL metadata. DuckDB sees a normal file; OpenDuck persists data as immutable sealed layers addressable from object storage. Snapshots give you consistent reads. One serialized write path, many concurrent readers.

Hybrid (dual) execution

A single query can run partly on your machine and partly on a remote worker. The gateway splits the plan, labels each operator LOCAL or REMOTE, and inserts bridge operators at the boundaries. Only intermediate results cross the wire.

Comment by decide1000 6 days ago

I built a distributed DuckDB setup using OpenRaft for state replication. Every node holds a full copy of the database. Writes go through Raft consensus, reads are local. It's more like etcd-with-DuckDB than MotherDuck-lite.

OpenDuck takes a different approach with query federation with a gateway that splits execution across local and remote workers. My use case requires every node to serve reads independently with zero network latency, and to keep running if other nodes go down.

The PostgreSQL dependency for metadata feels heavy. Now you're operating two database systems instead of one. In my setup DuckDB stores both the Raft log and the application data, so there's a single storage engine to reason about.

Not saying my approach is universally better. If you need to query across datasets that don't fit on a single machine, OpenDuck's architecture makes more sense. But if you want replicated state with strong consistency, Raft + DuckDB works very well.

Comment by citguru 6 days ago

[dead]

Comment by oulipo2 7 days ago

Seems cool! But would be nice to have some "real-world" use cases to see actual usage patterns...

In my case my systems can produce "warnings" when there are some small system warning/errors, that I want to aggregate and review (drill-down) from time to time

I was hesitating between using something like OpenTelemetry to send logs/metrics for those, or just to add a "warnings" table to my Timescaledb and use some aggregates to drill them down and possibly display some chunks to review...

but another possibility, to avoid using Timescaledb/clickhouse and just rely on S3 would be to upload those in a parquet file on a bucket through duckdb, and then query them from time to time to have stats

Would you have a recommendation?

Comment by citguru 6 days ago

The project is still fairly new and not close to production tbh.

I'd actually recommend the simplest option: just write them to Parquet on S3 and query with plain DuckDB. Or you could use Ducklake - https://ducklake.select/

Comment by throwatdem12311 6 days ago

There is no “real world” use case because it’s vibe coded slop.

Comment by jeadie 6 days ago

You might find https://github.com/apache/datafusion and https://github.com/datafusion-contrib/datafusion-federation of interest

Comment by citguru 6 days ago

Thanks for this, really enjoyed reading this and helps validate some of my personal thoughts

Comment by MisterTea 6 days ago

OT but: You joined in 2019, barely post anything, then suddenly in 2026 your comments are copy pasted LLM output. Why? Why don't you use your own voice and type with your own hands? Notice how all those copy pasta posts were nuked - for good reason - we don't like being insulted.

Comment by michael-wang 6 days ago

You joined in 2017, barely post anything, then suddenly in 2025/2026 2/3 of your posts are copy pasted links, 1 of which is dead and another is 10 years old. Why? Why don’t you use your own voice and type with your own hands? Why don’t you post something new and relevant that you made instead of attacking people who are posting entire code repos of interesting technology?

Comment by MisterTea 6 days ago

I call it suspicious activity.

Comment by youngbum 5 days ago

touché

Comment by Lucasoato 7 days ago

Last week I’ve sent my first PR in duckdb to support iceberg views in catalogs like Polaris! Let’s hope for the best :)

Comment by arpinum 6 days ago

I read the code. It's a good case study of one-shot output from AI when you ask it to replicate a SaaS product. This is probably better than most because MotherDuck has been open about their techniques to build the product.

Obviously not a production implementation.

Comment by throwatdem12311 6 days ago

There’s a million of these per day and I would never even think about using a single one of them near production data.

Show HN style posts have become completely worthless to me, everything now is just vibe coded cloud chasing slop.

Comment by 0xnadr 6 days ago

Neat DuckDB is already fast enough for most single-node workloads, so distributing it opens up some interesting use cases for larger datasets.

Comment by esafak 6 days ago

Would you use this instead of Spark and Clickhouse, supposing it worked? Is the idea that it pools local compute with a remote cluster?

Comment by prpl 6 days ago

Can you build most of this with Ray?