The perils of UUID primary keys in SQLite
Posted by emschwartz 4 days ago
Comments
Comment by blopker 4 days ago
For a single database, bigints are smaller and faster, with less footguns.
UUIDs can be nice for an opaque public ID, however I'd still prefer something like a Sqid for space and usability.
Comment by Fabricio20 4 days ago
But be careful!! Javascript WILL interpret your bigints as Number() and round them down because they are too big without telling you!!!
Famously seen by every snowflake user that has interacted with Javascript, quite an annoying problem.
Comment by silvestrov 3 days ago
Then it will always be a string and you will be free to change the format/type of the key in the future to UUID or whatever you like.
Comment by zmj 3 days ago
Comment by throw-the-towel 3 days ago
Why, you ask? Let's you have a number like 10,000 and you want to replace it with 20,000. You delete the leading 1, and boom! The number is now zero, and three of the digits are gone, and you'll have to retype them like you got no other things to do with your life.
Comment by myko 3 days ago
Comment by klysm 3 days ago
Comment by Terr_ 2 days ago
A similar horror story from PHP, which I discovered by diagnosing a test failure. (Or maybe it was in production? Long ago, can't remember.)
I think the code in question was for some kind of web auth, comparing random 32-character hexadecimal strings. PHP has a "feature" where its == operator falls back to trying certain strings as numbers... and that includes a version with scientific notation. (12000 == "12000" == "12e3")
Such a collision through bad comparison may seem unlikely, but there are two islands of higher odds: 0*10^X is zero for any X, and X*10^0 is one for any X. Finally, leading zeros can be included. ("0e1234" == "00000e1" and "1234e0" == "9e0000")
The fix was simply going to stricter ===, but it definitely reinforced my dislike of "loose" languages.
Comment by paulddraper 4 days ago
Node.js drivers will correctly read int64 as string or bigint, not number.
E.g. pg for PostgreSQL
Maybe there’s a buggy driver but I don’t know it.
Comment by Fabricio20 3 days ago
You can of course, change the api such that it does {"id": "1324535222364012585"} instead and voila, it will no longer try parsing it as number. Or the many other workarounds people have recommended above (like appending a prefix, or using a different encoding), but why is it trying to parse a number thats too big and instead of throwing it just rounds down without telling you????!
Comment by paulddraper 3 days ago
You seem to be talking about JSON. (Which technically has no limit on number size or precision, but in practice is float64.)
Comment by Piezoid 3 days ago
Comment by sheept 3 days ago
const json = '{ "a": 9007199254740993 }'
JSON.parse(json, (_key, value, context) => /^\d+$/.test(context.source) ? BigInt(context.source) : value)Comment by spiffytech 4 days ago
Comment by shakna 4 days ago
In JS - BigInt is 64bit integer.
In anything else - BigInt is a arbitrarily large integer.
Comment by anematode 4 days ago
Comment by mort96 3 days ago
Comment by sheept 3 days ago
const obj = { a: 9007199254740993n }
JSON.stringify(obj, (_key, value) => typeof value === 'bigint' ? JSON.rawJSON(value.toString()) : value)Comment by mort96 3 days ago
Comment by sheept 3 days ago
{"a":9007199254740993}
not {"a":"9007199254740993"}Comment by mort96 3 days ago
Comment by sheept 3 days ago
const json = '{ "a": 9007199254740993 }'
JSON.parse(json, (_key, value, context) => /^\d+$/.test(context.source) ? BigInt(context.source) : value)Comment by mort96 3 days ago
Comment by marcosdumay 3 days ago
Comment by Etheryte 4 days ago
Comment by JamesSwift 4 days ago
Comment by chrismorgan 3 days ago
All sequences use step 16.
Type A has discriminant/offset 0, yielding IDs {0, 16, 32, 48, 64, …}.
Type B has discriminant/offset 1, mapping to IDs {1, 17, 33, 49, 65, …}.
All the way up to Type P with discriminant/offset 15 and IDs {15, 31, 47, 63, 79, …}.
This is also trivially invertible so that you can determine the type from the ID.
A more common approach is to make IDs opaque strings and put a type prefix—A0, B12, P34, that kind of thing. But this way you can keep it as a number, if you wish.
Comment by throwawayo2oe 3 days ago
Comment by sgarland 2 days ago
Comment by pyuser583 4 days ago
Comment by mamcx 4 days ago
Comment by nickpeterson 4 days ago
Comment by sudoshred 4 days ago
Comment by 1659447091 2 days ago
Comment by usrnm 3 days ago
Comment by echoangle 3 days ago
Comment by pavo-etc 2 days ago
Comment by ErroneousBosh 3 days ago
Comment by beagle3 2 days ago
Comment by masklinn 4 days ago
It does not actually make it impossible to query the wrong table it just tells you quickly when you’ve done so.
Comment by willtemperley 3 days ago
Comment by sgarland 2 days ago
IME, most of the arguments for why UUIDs make things better are due to developer ignorance of RDBMS features (or B+tree performance).
Comment by bob1029 4 days ago
Inappropriate aliasing of integer keys allows for silent errors in queries because it will actually return some result a lot of the time. A UUID is immune to this problem. The model recognizes its mistake a lot more reliably when previously non-empty tables start showing up empty after attempting a join.
Comment by andersmurphy 4 days ago
Comment by Fire-Dragon-DoL 3 days ago
Comment by PUSH_AX 3 days ago
Comment by JamesSwift 2 days ago
Comment by crubier 4 days ago
Comment by adityaathalye 3 days ago
I read this post more as an illustration of the *value* of UUIDv7 as primary key, over integer primary keys, in lieu of minimal loss of read/write performance, and marginally more data on disk bloat.
SQLite's automatic integer rowID primary key is a no-brainer, when the SQLite application is local-only, such as application storage format (mobile and desktop). Or is never intended to grow beyond a single server instance. Basically, where each SQLite file is private to a singular instance of the application.
However, if there is even an outside chance of needing to cooperate across application instances, e.g. the minimal limit case of a personal knowledge base that should seamlessly sync across a person's devices, as well as a hosted service, then a high-quality sequential random ID starts to make a lot more sense. (No-brainer arbitrary table merges / splits / remerges, de-duplication, etc.)
Random ID primary key is a bad idea period, whether it be the UU kind or the SQ kind, or any other kind. As far as my DB knowledge goes, this class of ID destroys all tree-algorithms, and we are stuck with the fact that there is no practically better way, than an appropriate tree-structure, to group and organise a meaningful amount of data, efficiently and effectively.
Comment by andersmurphy 3 days ago
Comment by adityaathalye 3 days ago
cf. https://sqlite.org/withoutrowid.html
> The WITHOUT ROWID syntax is an optimization. It provides no new capabilities. Anything that can be done using a WITHOUT ROWID table can also be done in exactly the same way, and exactly the same syntax, using an ordinary rowid table. The only advantage of a WITHOUT ROWID table is that it can sometimes use less disk space and/or perform a little faster than an ordinary rowid table.
As of now, I am doing the following in my (Bitemporal data system) experiment (When will it see the light of day? Nobody knows.).
All data are globally uniquely identified by a UUIDv7. However all tables have `rowid` integer primary key asc (which is just an alias for SQLite's autoincrement int id). The `rowid` is the basis for joins, and is the foreign key reference. This lets me offload some useful disambiguation work to the DB as well as have it enforce global (across data systems) record uniqueness guarantees, while retaining local (within process) query efficiency by retaining the ability to use integer rowids.
While the idealised insert performance in your bench is indeed mind-boggling, the DB Schema isn't doing anything CPU-intensive during inserts (checks, constraints, triggers etc.). My schema / query pattern yields comparatively meagre throughput, but I am happy with the ballpark it has landed in, given all the work I'm making SQLite do for me on each `assert!` and `redact!`.
cf. my dirty-but-useful-enough bench, with production-like record content:
A poor man's napkin-mathy, append-only SQLite write/read benchmark
https://gist.github.com/adityaathalye/3c8195dc70626b33c23867...
Summary:
;; Okay, I think I can live with this...
;; - "facts" table: 12M+ records
;; - single process writes to it
;; - ~ 400 transactions/second
;; - append-only table, enforced via SQLite "before" triggers
;; - "now" table:
;; - updates on every assert/redact on "facts" table, via triggers
;; - currently at "limit case": for each read it is empty, or very small, because writes do back-to-back assert/redact of the same fact
;; - gets reads from two reader threads (evenly split)
;; - ~41,000 reads/second
;; - all reads are concurrent with writes (poor man's futures)Comment by adityaathalye 3 days ago
Thanks to its oh so convenient automatic integer rowIDs, I believe one can amortise some of the other overheads of UUIDv7s for "in-between" queries, viz. indices, joins, ctes, virtual tables etc., with appropriate schema / query design.
Comment by jdthedisciple 3 days ago
UUID v7 so far seems like the best solution if you want UUID benefits and ordering.
Comment by scotty79 3 days ago
Why would you force database to order rows on the drive according to random id?
Comment by chromatin 3 days ago
Comment by scotty79 2 days ago
I never wished, gee, why didn't I use integer key. But so many times I wished I used uuid because eventually your data rows are going to need to have identity that is not local to this specific database instance.
Comment by yepyoukno 4 days ago
Comment by themafia 4 days ago
I disagree. I tried this once. Now you need a client access layer to touch the DB in any context. All your console tools no longer work well or at all. If they show up in URLs you need to deoptimize them for transport.
You give up a lot of convenience for this optimization. You should be absolutely sure your design requires it before using it.
Comment by JSR_FDED 4 days ago
Comment by yepyoukno 4 days ago
Comment by dexterdog 4 days ago
Comment by littlecranky67 3 days ago
Comment by voakbasda 3 days ago
Comment by Volundr 3 days ago
Comment by dexterdog 3 days ago
Comment by everforward 3 days ago
Eg I save a date 18 months in the future in US Western time at 0600. 3 months from now the US gets rid of Daylight Savings. You saved it as 1400 UTC (which was correct), which is now an hour off because the local time zone is different.
Encoding it in local time fixes that, because it delays the TZ conversion until you need it and your times stay correct as long as you update tzdb like once a month.
Comment by littlecranky67 3 days ago
Comment by cenamus 4 days ago
Comment by antihero 3 days ago
Why would you store it as as str column and not the inbuilt type for this?
https://www.postgresql.org/docs/current/datatype-uuid.html
If you are using SQLite well I guess that doesn't work.
Comment by sedatk 3 days ago
So, UUIDv4 as a PK on a clustered index can be perfectly feasible for cases where you want to avoid disclosing stuff and row insertion performance isn’t that important.
Comment by andersmurphy 4 days ago
Comment by ysleepy 3 days ago
I get saving 8 bytes per row seems attractive, but the tradeoff is not explained.
Comment by andersmurphy 3 days ago
Comment by keynha 3 days ago
The tradeoff is what the benchmark is hitting. Once the table is physically ordered by the key, a random v4 scatters every insert across the tree and you pay for the page splits. A plain rowid table keeps that churn in the secondary index, which is just the key plus a rowid, while the table itself stays append-ordered. So it only really pays off when the key is something you look up directly and is roughly sequential, which is why v7 comes back near baseline.
Comment by bambax 3 days ago
Comment by elcomet 3 days ago
Comment by victorbjorklund 3 days ago
Comment by pyuser583 4 days ago
Comment by ItsBob 3 days ago
The Integer id is used for joins and looks ups and such but that's it. If I need to send anything to the frontend or outside of the app/DB then that's the UUID.
Comment by gvkhna 2 days ago
Comment by ItsBob 1 day ago
The Guid is purely for an external system to grab onto something that I can tie back to an actual row in the database but the external system does not need to know anything about the backend other than <guid>.
Comment by w10-1 4 days ago
How much trouble does SQLite reysing rowid's actually cause?
Comment by andersmurphy 4 days ago
Regular rowids are definitely the way to go if you can use them.
Comment by dumbledorf 4 days ago
Comment by kg 4 days ago
Comment by andersmurphy 4 days ago
Comment by JSR_FDED 4 days ago
Comment by smitty1e 4 days ago
Comment by KPGv2 4 days ago
(defonce db
(d/init-db! "db/db.db"
{:pool-size 4 :pragma {:synchronous "FULL"}}))
That's writing to disk.Comment by andersmurphy 4 days ago
There's only one index so there's no real write amplification. The numbers will go down as you add more data and indexes.
Comment by ac50hz 3 days ago
Comment by gvkhna 2 days ago
But a Url62 as a url safe public id from the pk is simple and straightforward to use and comes with few risks of leak issues. Wish postgres had native base62 encoding for url62 now that it has uuidv7 native.
Comment by kjgkjhfkjf 3 days ago
Comment by andersmurphy 3 days ago
Comment by cropcirclbureau 3 days ago
Comment by andersmurphy 3 days ago
Contention and coordination are real killers, concurrent writes (that require coordination like postgres) often underdeliver.
Comment by michaelcampbell 3 days ago
Comment by andersmurphy 3 days ago
Comment by u1hcw9nx 3 days ago
Also INSERT speed instead of SELECT? Typically most time is spend in SELECT or UPDATE.
Comment by andersmurphy 3 days ago
Comment by knightops_dev 3 days ago
Comment by wood_spirit 4 days ago