SQLite JSON at full index speed using generated columns
Posted by upmostly 3 days ago
Comments
Comment by eliasdejong 3 days ago
Disclaimer: I am working on this.
Comment by c-cube 8 hours ago
[1]: https://lite3.io/design_and_limitations.html#autotoc_md31 [2]: https://github.com/couchbase/fleece/
Comment by conradev 3 days ago
Comment by eliasdejong 3 days ago
1) Rkyv uses a binary tree vs Lite³ B-tree (B-trees are more cache and space efficient).
2) Rkyv is immutable once serialized. Lite³ allows for arbitrary mutations on serialized data.
3) Rkyv is Rust only. Lite³ is a 9.3 kB C library free of dependencies.
4) Rkyv as a custom binary format is not directly compatible with other formats. Lite³ can be directly converted to/from JSON.
I have not benchmarked Lite³ against Rust libraries, though it would be an interesting experiment.
Comment by conradev 2 days ago
If you added support for running jq natively, that would be very cool. Lite³ brings the B-trees, jq brings the query parser and bytecode, combined, you get SQLite :P
Comment by eliasdejong 2 days ago
I thought about implementing something like jq or JSON query, and this is very possible. It is like sending a mini-database that can be queried at speeds thousands of times faster than any JSON library is able to parse.
One interesting effect of being a zero-copy format is that the 'parsing speed' can exceed the memory bandwidth of the CPU, since to fulfill a query you do not actually need to parse the entire dataset. You only walk the branches of the tree that are actually required.
I've talked to some other people that have also shown interest in this idea. There doesn't really seem to exist a good schemaless single-file format that supports advanced queries. There is only SQLite and maybe HDF5.
Comment by conradev 1 day ago
- Microcontrollers. I find myself reaching for https://github.com/siara-cc/sqlite_micro_logger_c/tree/maste... because SQLite is just too big
- Shared memory regions/mapped files. Use it to share state between processes. Could you make mutations across processes/threads lock-free?
- Caching GPU-friendly data (i.e. image cache). I'm not sure if the current API surface/structure is page alignment friendly
Comment by eliasdejong 1 day ago
SQLite when compiled for size is 590 kB. But I think a full jq database implementation based on Lite³ would be possible under 100 kB.
Lock-free shared state relies on algorithms that can make clever use of atomic instructions. But you should not have threads write to the same memory regions, because the hardware only allows for 1 core to have a cacheline in a writeable state. If another core attempts a write to the same line, this will immediately invalidate all the other copies. Under high contention the coherency penalty becomes so large that throughput falls through the floor. So basically the algorithms need to do most of the work in separate memory regions, then occasionally coordinate by 'committing' their work via a spinlock or similar.
Lite³ implements some settings for node alignment, but not for user data. It would be possible to create a bytes type with extra alignment guarantees.
Comment by cryptonector 2 days ago
Comment by namibj 2 days ago
2) no you just don't get to move data freely.
3) I don't believe JSON has any place in a system that needs C because it can't handle Rust.
4) JSON can't handle non-tree structures, it's further very limited in expressivity. Rkyv is more of a code gen akin to ASN.1
Happy benchmarking, feel free to use the rkyv benchmark tooling and ensure you have enough link time optimization going on.
Comment by cryptonector 2 days ago
How does Lite^3 compare to PG's JSONB? PG's JSONB is also a serialized, indexed data structure. One of the key things about JSONB is that for arrays (and so objects) it encodes first their lengths, then the values, but every so many elements (32 is the default IIRC) it encodes an offset, and the reason for this design is that when they encoded offsets only the result did not compress well (and if you think about it it will be obvious why). The price they pay for this design is that finding the offset to the nth element's value requires first finding the offset of the last entry before n that has an offset, then adding all the lengths of the entries in between. This way you get a tunable parameter for trading off speed for compressibility.
EDIT: Ok, I've looked at the format. Some comments:
- Updating in place is cool but you need to clear unused replaced data in case it's sensitive, and then unless you re-encode you will use up more and more space -- once in a while you need a "vacuum". Though vacuuming a Lite^3 document is quite simple: just traverse the data structure and write a new version, and naturally it will be vacuumed.
- On the whole I like Lite^3 quite a bit. Very clever.
- JSONB is also indexed as encoded, but IIUC it's not in-place updateable (unless the new items are the same length as the old) without re-encoding. Though I can imagine a way to tombstone old values and replace them with offsets into appended data, then the result would also need a "vacuum" once in a while.
- I'm curious about compressibility. I suspect not having long runs of pointers (offsets) helps, but still I suspect JSONB is more compressible.
I love the topic of serialization formats, and I've been thinking for some time about ASN.1 compilers (since I maintain one). I've wanted to implement a flatbuffers / JSONB style codec for ASN.1 borrowing ideas from OER. You've given me something to think about! When you have a schema (e.g., an ASN.1 module) you don't really need a B-tree -- the encoded data, if it's encoded in a convenient way, is the B-tree already, but accessing the encoded data by traversal path rather than decoding into nice in-memory structures sure would be a major improvement in codec performance!
Comment by eliasdejong 2 days ago
Another difference is that JSONB is immutable. Suppose you need to replace one specific value inside an object or array. With JSONB, you would rewrite the entire JSONB document as a result of this, even if it is several megabytes large. If you are performing frequent updates inside JSONB documents, this will cause severe write amplification. Despite the fact that offsets are grouped in chunks of 32, Postgres still rewrites the entire document. This is the case for all current Postgres versions.
On the other hand, Lite³ supports replacing of individual values where ONLY the changed value needs updating. For this to work, you need separate offsets. Postgres makes a tradeoff where they get some benefits in size, but as a result become completely read-only. This is the case in general for most types of compression.
Also JSONB is not suited to storing binary data. The user must use a separate bytea column. Lite³ directly implements a native bytes type.
JSONB was designed to sacrifice mutability in favor of read performance, but despite this, I still expect Lite³ to exceed it at read performance. Of course it is hard to back this up without benchmarks, but there are several reasons:
1) JSONB performs runtime string comparison loops to find keys. Lite³ uses fixed-size hash digests comparisons, where the hashes are computed at compile time.
2) JSONB must do 'walking back' because of the 32-grouped offset scheme.
3) Lite³ has none of the database overhead.
Again, the two formats serve a different purpose, but comparing just the raw byte layouts.
Comment by cryptonector 2 days ago
I agree that Lite³ is almost certainly better than JSONB on every score except compressibility, but when Lite³ is your database format then that doesn't matter (you can always compress large string/blob values if need be). Compressibility might matter for interchange however, but again, if your messages are huge chances are there are compressible strings in them, or if they're not huge then you probably don't care to compress.
Comment by nh2 1 day ago
It'd be nice to update postgres JSON values without the big write amplification.
Comment by namibj 2 days ago
Comment by the_duke 3 days ago
Comment by gritzko 2 days ago
Comment by srameshc 3 days ago
Loading data into DuckDB is super easy, I was surprised :
SELECT avg(sale_price), count(DISTINCT customer_id) FROM '/my-data-lake/sales/2024/*.json';
and you can also load into a JSON type column and can use postgres type syntax col->>'$.key'
Comment by loa_observer 3 days ago
but i would say, comparing duckdb and sqlite is a little bit unfair, i would still use sqlite to build system in most of cases, but duckdb only for analytic. you can hardly make a smooth deployment if you apps contains duckdb on a lot of platform
Comment by trueno 3 days ago
someone should smush sqlite+duckdb together and do that kind of switching depending on query type
Comment by mikepurvis 3 days ago
Comment by NortySpock 3 days ago
That being said, it would be trivial to tweak the above script into two steps, one reading data into a DuckDB database table, and the second one reading from that table.
Comment by lame_lexem 3 days ago
Comment by hawk_ 3 days ago
Comment by jelder 3 days ago
Comment by craftkiller 3 days ago
CREATE INDEX idx_status_gin
ON my_table
USING gin ((data->'status'));
ref: https://www.crunchydata.com/blog/indexing-jsonb-in-postgresComment by jelder 3 days ago
Comment by craftkiller 3 days ago
Comment by cies 3 days ago
Comment by morshu9001 3 days ago
Comment by a-priori 3 days ago
If you can get away with a regular index on either a generated column or an expression, then you absolutely should.
Comment by ramon156 3 days ago
For example, if you want to store settings as JSON, you first have to parse it through e.g. Zod, hope that it isn't failing due to schema changes (or write migrations and hope that succeeds).
When a simple key/value row just works fine, and you can even do partial fetches / updates
Comment by jelder 3 days ago
Comment by mickeyp 3 days ago
Comment by morshu9001 3 days ago
Comment by sigwinch 3 days ago
Comment by jasonthorsness 3 days ago
Comment by upmostly 3 days ago
Comment by kevinsync 3 days ago
Lesson learned: even if you know your tools well, periodically go check out updated docs and see what's new, you might be surprised at what you find!
Comment by daotoad 2 days ago
Comment by tracker1 3 days ago
That said, this is pretty much what you have to do with MS-SQL's limited support for JSON before 2025 (v17). Glad I double checked, since I wasn't even aware they had added the JSON type to 2025.
Comment by advisedwang 3 days ago
Comment by selimthegrim 2 days ago
Comment by tracker1 7 hours ago
Most of the JSON functions added in iirc MS-SQL 2016 really performed poorly and is a significant reason why denormalized JSON data was used very sparingly... with actual JSON data types (assuming a binary deserialized form of storage), then queries and operations against that underlying data structure can run significantly faster.
I've been pretty critical of it since I tried using it for a few things a few years ago... it still worked well enough for the needs of what it was doing, but I'm glad that it's doing better.
For reference, what it was being used for was to semi-normalize most stored procedures to receive 2 argumenst and return 2. All JSON... the first argument would be the claims portion of the JWT for the service, the second would be a serialized typed request object representing the request to the service and the two results are the natural results to the sproc as well as an error result if an error occurred. This allowed for a very simplified API surface (basically 4 utility methods being used for all API calls), in the project in question it was a requirement for data logic to be inside the database, of which I'm not a fan, but it did work out pretty well for what it was. Other isseus not withstanding.
Comment by Lex-2008 3 days ago
i.e. something like this: CREATE INDEX idx_events_type ON events(json_extract(data, '$.type'))?
i guess caveat here is that slight change in json path syntax (can't think of any right now) can cause SQLite to not use this index, while in case of explicitly specified Virtual Generated Columns you're guaranteed to use the index.
Comment by pkhuong 3 days ago
Comment by fny 3 days ago
It's pretty fragile...
-- Just changing the quoting
select * from events where json_extract(data, "$.type") = 'click';
-- Changing the syntax
select * from events where data -> '$.type' = 'click';
Basically anything that alters the text of an expression within the where clauseComment by johnmaguire 3 days ago
Comment by paulddraper 3 days ago
You need to ensure your queries match your index, but when isn’t that true :)
Comment by 0x457 3 days ago
When you write another query against that index a few weeks later and forget about the caveat, that slight change in where clause will ignore that index.
Comment by WilcoKruijer 3 days ago
> The ability to index expressions was added to SQLite with version 3.9.0 (2015-10-14).
So this is a relatively new addition to SQLite.
Comment by debugnik 3 days ago
Comment by Lex-2008 3 days ago
Comment by bambax 2 days ago
> So, thanks bambax!
You're most welcome! And yes, SQLite is awesome!!
Comment by kristianp 2 days ago
Comment by ellisv 3 days ago
It's much harder to setup proper indexes, enforce constraints, and adds overhead every time you actually want to use the data.
Comment by nh2 3 days ago
* The data does not map well to database tables, e.g. when it's tree structures (of course that could be represented as many table rows too, but it's complicated and may be slower when you always need to operate on the whole tree anyway)
* your programming language has better types and programming facilities than SQL offers; for example in our Haskell+TypeScript code base, we can conveniently serialise large nested data structures with 100s of types into JSON, without having to think about how to represent those trees as tables.
Comment by cies 3 days ago
I find this one of the hardest part of using JSON, and the main reason why I rather put it in proper columns. Once I go JSON I needs a fair bit of code to deal with migrartions (either doing them during migrations; or some way to do them at read/write time).
Comment by nh2 3 days ago
Comment by kccqzy 3 days ago
Comment by tracker1 3 days ago
Another example is a classifieds website, where your extra details for a Dress are going to be quite a bit different than the details for a Car or Watch. But, again, you don't necessarily want to inflate the table structure for a fully normalized flow.
If you're using a concretely typed service language it can help. C# does a decent job here. But even then, mixing in Zod with Hono and OpenAPI isn't exactly difficult on the JS/TS front.
Comment by dzonga 3 days ago
tryna map everything in a relational way etc - you're in a world of pain
Comment by crazygringo 3 days ago
But the more complex it is, the more complex the relational representation becomes. JSON responses from some API's could easily require 8 new tables to store the data in, with lots of arbitrary new primary keys and lots of foreign key constraints, your queries will be full of JOIN's that need proper indexing set up...
Oftentimes it's just not worth it, especially if your queries are relatively simple, but you still need to store the full JSON in case you need the data in the future.
Obviously storing JSON in a relational database feels a bit like a Frankenstein monster. But at the end of the day, it's really just about what's simplest to maintain and provides the necessary performance.
And the whole point of the article is how easy it is to set up indexes on JSON.
Comment by jasonthorsness 3 days ago
Comment by whizzter 3 days ago
Typical example is a price-setting product I work on.. there's price ranges that are universal (and DB columns reflect that part) but they all have weird custom requests for pricing like rebates on the 3rd weekend after X-mas (but only if the customer is related to Uncle Rudolph who picks his nose).
Comment by fauigerzigerk 3 days ago
There's no reason to put all those extra fields in the same table that contains the universal pricing information.
Comment by whizzter 13 hours ago
I'll give a comparison.
JSON
- We have some frontend logic/view (that can be feature-flagged per customer) to manage updating the data that's otherwise mostly tagging along as a dumb "blob" (auto-expanded to regular a part of the JSON objects maps/arrays at the API boundary making frontend work easier, objects on the frontend, "blobs" on the backend/db)
- Inspecting specfic cases (most of the time it's just null data) is just copying out and formatting the special data.
- If push comes to shows, all modern databases support JSON queries so you can pick out specifics IF needed (has happened once or twice with larger customers over the years).
- We read and apply the rules when calculating prices with a "plugin system"
DB Schema (extra tables)
- Now you have to wade through lots of customer-specific tables just to find the tables that takes most of the work-time (customer specifics are seldomly what needs work once setup). We already have some older customer-specific stuff from the early days (I'm happy that it's not happened much lately).
- Those _very_ few times you actually need to inspect the specific data by query you might win on this (but as mentioned above, JSON queries has always solved it).
- Loading the universal info now needs to query X extra tables (even when 90%-95% of the data has no special cases).
- Adding new operations on prices like copying,etc now needs to have logic for each piece of customer specific table to properly make it tag along.
- "properly" modelled this reaches the API layer as well
- Frontend specialization is still needed
- Calculating prices still needs it's customization.
I don't really see how my life would have been better for managing all extra side-effects of bending the code to suit these weird customer requests (some that aren't customers anymore) when 90-95% of the time it isn't used and seldomly touched upon with mature customers.
I do believe in the rule of 3, if the same thing pops up three times I do consider if that needs to be graduated to more "systematic" code, so often when you abstract after seeing something even 2 times it never appears again leaving you with some abstraction to maintain.
JSON columns, like entity-attribute-value tables or goto statements all have real downsides and shouldn't be plonked in without a reason, but hell if I'd have to work with overly complex schemas/models because people start putting special cases into core pieces of code just because they heard that a technique was bad.
Comment by konart 3 days ago
Comment by verytrivial 3 days ago
Comment by AlexErrant 3 days ago
Comment by garaetjjte 3 days ago
Comment by MyOutfitIsVague 3 days ago
Comment by rini17 3 days ago
Comment by simonw 3 days ago
Comment by hamburglar 3 days ago
Comment by upmostly 3 days ago
Comment by upmostly 3 days ago
Edit: This should now be fixed for you.
Comment by rrmdp 3 days ago
Comment by Seattle3503 3 days ago
Comment by pawelduda 3 days ago
Comment by dmezzetti 3 days ago
You can do the same with DuckDB and Postgres too.
Comment by focusgroup0 3 days ago
Comment by zffr 3 days ago
That migration would be making two changes: document-based -> relational, and server -> library.
Have you considered migrating to Postgres instead? By using another DB server you won't need to change your application as much.
Comment by focusgroup0 3 days ago
Comment by rglynn 2 days ago
Comment by mcluck 3 days ago
Comment by meindnoch 3 days ago
Why?
Comment by hamburglar 3 days ago
Comment by ralferoo 3 days ago
But I suspect with JSON the overhead of parsing it each time might make it more efficient to update all the indices with every insert.
Then again, it's probably quicker still to insert the raw SQL into a temporary table in memory and then insert all of the new rows into the indexed table as a single query.
Comment by hiccuphippo 3 days ago
Comment by meindnoch 3 days ago
No, in section 2 the table is created afresh. All 3 sections start with a CREATE TABLE.
Comment by hiccuphippo 3 days ago
Comment by upmostly 3 days ago
Comment by moregrist 3 days ago
Comment by oars 3 days ago
Comment by zackify 2 days ago
Particularly with drizzle, it means I can use sqlite on device with expo-sqlite, and store our data format in a single field, with very little syntax, and the schema and queries all become fully type safe.
Also being able to use the same light orm abstraction server side with bun:sqlite is huge.
Comment by rcarmo 3 days ago
Comment by morshu9001 3 days ago
Comment by bushbaba 3 days ago
Comment by bilekas 3 days ago
It's a feature, not a replacement.
Comment by mring33621 3 days ago
Comment by kwillets 3 days ago
Comment by xp84 3 days ago
Comment by N_Lens 3 days ago
Comment by baq 3 days ago
Comment by javantanna 3 days ago
Comment by pipe01 3 days ago
Comment by stacktraceyo 3 days ago
Comment by maxpert 3 days ago
But this technique I guess is very common now.
Comment by jb_rad 3 days ago
Comment by groundzeros2015 3 days ago
What?
Comment by qwertox 3 days ago