Avoid UUIDv4 Primary Keys
Posted by pil0u 4 hours ago
Comments
Comment by vintermann 2 hours ago
Permanent identifiers should not carry data. This is like the cardinal sin of data management. You always run into situations where the thing you thought, "surely this never changes, so it's safe to squeeze into the ID to save a lookup". Then people suddenly find out they have a new gender identity, and they need a last final digit in their ID numbers too.
Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"
Librarians in the analog age can be forgiven for cramming data into their identifiers, to save a lookup. When the lookup is in a physical card catalog, that's somewhat understandable (although you bet they could run into trouble over it too). But when you have a powerful database at your fingertips, use it! Don't make decisions you will regret just to shave off a couple of milliseconds!
Comment by barrkel 1 hour ago
Random vs time biased uuids are not a decision to shave off ms that you will regret.
Most likely they will be a decision that shaves off seconds (yes, really - especially when you consider locality effects) and you'll regret nothing.
Comment by bri3d 51 minutes ago
What? The first 48 bits of an UUID7 are a UNIX timestamp.
Whether or not this is a meaningful problem or a benefit to any particular use of UUIDs requires thinking about it; in some cases it’s not to be taken lightly and in others it doesn’t matter at all.
I see what you’re getting at, that ignoring the timestamp aspect makes them “just better UUIDs,” but this ignores security implications and the temptation to partition by high bits (timestamp).
Comment by mkleczek 2 hours ago
The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
Comment by vrighter 43 minutes ago
Comment by tacone 2 hours ago
I don't agree with the absolute statement, though. Permanent identifiers should not generally carry data. There are situations where you want to have a way to reconciliate, you have space or speed constraints, so you may accept the trade off, md5 your data and store it in a primary index as a UUID. Your index will fragment and thus you will vacuum, but life will still be good overall.
Comment by mckirk 2 hours ago
Comment by hyperpape 2 hours ago
Comment by vintermann 1 hour ago
I think IDs should not carry information. Yes, that also means I think UUIDv7 was wrong to squeeze a creation date into their ID.
Isn't that clear enough?
Comment by mcny 1 hour ago
So, the fact that there is a date in the uuidv7 does not extend any meaning or significance to the record outside of the database. To infer such a relationship where none exists is the error.
Comment by vintermann 44 minutes ago
I bet people will extract that date and use it, and it's hard to imagine use which wouldn't be abuse. To take the example of a PN/SSN and the usual gender bit: do you really want anyone to be able to tell that you got a new ID at that time? What could you suspect if a person born in 1987 got a new PN/SSN around 2022?
Leaks like that, bypassing whatever access control you have in your database, is just one reason to use real random IDs. But it's even a pretty good one in itself.
Comment by mcny 36 minutes ago
Thank you for spelling it for me. For the readers, It leaks information that the person is likely not a natural born citizen. The assumption doesn't have to be a hundred percent accurate, There is a way to make that assumption And possibly hold it against you.
And there are probably a million ways that a record created date could be held against you If they don't put it in writing, how will you prove They discriminated against you.
Thinking... I don't have a good answer to this. If data exists, people will extract meaning from it whether rightly or not.
Comment by anamexis 29 minutes ago
And when working with very large datasets, there are very significant downsides to large, completely random IDs (which is of course what the OP is about).
Comment by hyperpape 38 minutes ago
Comment by barrkel 1 hour ago
Comment by zamadatix 35 minutes ago
Comment by anamexis 50 minutes ago
Comment by oncallthrow 2 hours ago
Comment by vintermann 1 hour ago
Comment by sgarland 33 minutes ago
Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.
> A prime example of premature optimization.
Disagree. Data is sticky, and PKs especially so. Moreover, if you’re going to spend time optimizing anything early on, it should be your data model.
> Don't make decisions you will regret just to shave off a couple of milliseconds!
A bad PK in some databases (InnoDB engine, SQL Server if clustered) can cause query times to go from sub-msec to tens of msec quite easily, especially with cloud solutions where storage isn’t node-local. I don’t just mean a UUID; a BIGINT PK on a 1:M can destroy your latency for the simple reason of needing to fetch a separate page for every record. If instead the PK is a composite of (<linked_id>, id) - e.g. (user_id, id) - where id is a monotonic integer, you’ll have WAY better data locality.
Postgres suffers a different but similar problem with its visibility map lookups.
Comment by hnfong 1 hour ago
Specifically, if your database is small, the performance impact is probably not very noticeable. And if your database is large (eg. to the extent primary keys can't fit within 32-bit int), then you're actually going to have to think about sharding and making the system more distributed... and that's where UUID works better than auto-incrementing ints.
Comment by oblio 1 hour ago
Someone should have told Julius Caesar and Gregory XIII that :-p
Comment by gwbas1c 21 minutes ago
In our case, we don't want database IDs in an API and in URLs. When IDs are sequential, it enables things like dictionary attacks and provides estimates about how many customers we have.
Encrypting a database ID makes it very obvious when someone is trying to scan, because the UUID won't decrypt. We don't even need a database round trip.
Comment by benterix 3 hours ago
Comment by mort96 3 hours ago
Now this doesn't work if you actually have enough data that the randomness of the UUIDv4 keys is a practical database performance issue, but I think you really have to think long and hard about every single use of identifiers in your application before concluding that v7 is the solution. Maybe v7 works well for some things (e.g identifiers for resources where creation times are visible to all with access to the resource) but not others (such as users or orgs which are publicly visible but without publicly visible creation times).
Comment by throw0101a 2 hours ago
See perhaps "UUIDv47 — UUIDv7-in / UUIDv4-out (SipHash‑masked timestamp)":
* https://github.com/stateless-me/uuidv47
* Sept 2025: https://news.ycombinator.com/item?id=45275973
Comment by wongarsu 1 hour ago
UUIDv47 might have a space if you need keys generated on multiple backend servers without synchronization. But it feels very niche to me.
Comment by nbadg 3 hours ago
I've read people suggest using a UUIDv7 as the primary key and a UUIDv4 as a user-visible one as a remedy.
My first thought when reading the suggestion was, "well but you'll still need an index on the v4 IDs, so what does this actually get you?" But the answer is that it makes joins less expensive; you only require the index once, when constructing the query from the user-supplied data, and everything else operates with the better-for-performance v7 IDs.
To be clear, in a practical sense, this is a bit of a micro-optimization; as far as I understand it, this really only helps you by improving the data locality of temporally-related items. So, for example, if you had an "order items" table, containing rows of a bunch of items in an order, it would speed up retrieval times because you wouldn't need to do as many index traversals to access all of the items in a particular order. But on, say, a users table (where you're unlikely to be querying for two different users who happen to have been created at approximately the same time), it's not going to help you much. Of course the exact same critique is applicable to integer IDs in those situations.
Although, come to think of it, another advantage of a user-visible v4 with v7 Pk is that you could use a different index type on the v4 ID. Specifically, I would think that a hash index for the user-visible v4 might be a halfway-decent way to go.
I'm still not sure either way if I like the idea, but it's certainly not the craziest thing I've ever heard.
Comment by cdmckay 3 hours ago
Comment by mort96 3 hours ago
Comment by nish__ 2 hours ago
Comment by mort96 2 hours ago
Comment by nish__ 2 hours ago
Comment by 0x3f 2 hours ago
Comment by bruce511 2 hours ago
For one example, say you were making voting-booth software. You really don't want a (hidden) timestamp attached to each vote (much less an incrementing id) because that would break voter confidentiality.
More generally, it's more a underlying principle of data management. Not leaking ancillary data is easier to justify than "sure we leak the date and time of the record creation, but we can't think of a reason why that matters."
Personally I think the biggest issue are "clever" programmers who treat the uuid as data and start displaying the date and time. This leads to complications ("that which is displayed, the customer wants to change"). It's only a matter of time before someone declares the date "wrong" and it must be "fixed". Not to mention time zone or daylight savings conversions.
Comment by natch 1 hour ago
This doesn’t even rely on your system’s built-in RNG being low quality. It could be audited and known to avoid such issues but you could have a compromised compiler or OS that injects a doctored RNG.
Comment by kreetx 3 hours ago
Comment by saaspirant 2 hours ago
Comment by JetSetIlly 1 hour ago
However, because the online ordering system assigned order numbers sequentially, it would have been trivial for that company to determine how important their business was.
For example, over the course of a month, they could order something at the start of the month and something at the end of the month. That would give them the total number of orders in that period. They already know how many orders they have placed during the month, so company_orders / total_orders = percentage_of_business
It doesn't even have to be accurate, just an approximation. I don't know if they figured out that they could do that but it wouldn't surprise me if they had.
Comment by 0x3f 1 hour ago
Comment by robertlagrant 3 hours ago
Comment by lwhi 3 hours ago
Comment by Bombthecat 2 hours ago
Comment by dboreham 57 minutes ago
Comment by formerly_proven 2 hours ago
(What do you think Youtube video IDs are?)
Comment by enz 1 hour ago
I shared this article a few weeks ago, discussing the problems with this kind of approach: https://notnotp.com/notes/do-not-encrypt-ids/
I believe it can make sense in some situations, but do you really want to implement such crypto-related complexity?
Comment by benterix 1 hour ago
Comment by conradfr 53 minutes ago
Comment by Retr0id 1 hour ago
Comment by pdimitar 2 hours ago
I actually haven no idea. What are they?
(Also what is the format of their `si=...` thing?)
Comment by dotancohen 3 hours ago
> Random values don’t have natural sorting like integers or lexicographic (dictionary) sorting like character strings. UUID v4s do have "byte ordering," but this has no useful meaning for how they’re accessed.
Might the author mean that random values are not sequential, so ordering them is inefficient? Of course random values can be ordered - and ordering by what he calls "byte ordering" is exactly how all integer ordering is done. And naive string ordering too, like we would do in the days before Unicode.Comment by kreetx 3 hours ago
Comment by dotancohen 3 hours ago
But the author does not say timestamp ordering, he says ordering. I think he actually means and believes that there is some problem ordering UUIDv4.
Comment by kreetx 3 hours ago
But if you need UUID-based lookup, then you might as well have it as a primary key, as that will save you an extra index on the actual primary key. If you also need a date and the remaining bits in UUIDv7 suffice for randomness, then that is a good option too (though this does essentially amount to having a composite column made up of datetime and randomness).
Comment by torginus 2 hours ago
Auto-incrementing keys can work, but what happens when you run out of integers? Also, distributed dbs probably make this hard, and they can't generate a key on client.
There must be something in Postgres that wants to store the records in PK order, which while could be an okay default, I'm pretty sure you can this behavior, as this isn't great for write-heavy workloads.
Comment by torginus 2 hours ago
Comment by K0nserv 2 hours ago
Comment by sgarland 22 minutes ago
Comment by dagss 2 hours ago
Accessing data in totally random locations can be a performance issue.
Depends on lots of things ofc but this is the concern when people talk about UUID for primary keys being an issue.
Comment by dev_l1x_be 2 hours ago
Comment by sagarm 47 minutes ago
Comment by hans_castorp 1 hour ago
Comment by sgarland 24 minutes ago
Comment by crest 2 hours ago
Comment by xandrius 3 hours ago
If you're using latest version of PG, there is a plugin for it.
That's it.
Comment by sbuttgereit 3 hours ago
"Recommendation: Stick with sequences, integers, and big integers"
After that then, yes, UUIDv7 over UUIDv4.
This article is a little older. PostgreSQL didn't have native support so, yeah, you needed an extension. Today, PostgreSQL 18 is released with UUIDv7 support... so the extension isn't necessary, though the extension does make the claim:
"[!NOTE] As of Postgres 18, there is a built in uuidv7() function, however it does not include all of the functionality below."
What those features are and if this extension adds more cruft in PostgreSQL 18 than value, I can't tell. But I expect that the vast majority of users just won't need it any more.
Comment by tmountain 3 hours ago
Comment by sgarland 19 minutes ago
Comment by zwnow 2 hours ago
Comment by sbuttgereit 2 hours ago
And for those using ints as keys... you'd be surprised how many databases in the wild won't come close to consuming that many IDs or are for workloads where that sort of volume isn't even aspirational.
Now, to be fair, I'm usually in the UUID camp and am using UUIDv7 in my current designs. I think the parent article makes good points, but I'm after a different set of trade-offs where UUIDs are worth their overhead. Your mileage and use-cases may vary.
Comment by zwnow 1 hour ago
Comment by hans_castorp 1 hour ago
Comment by ivan_gammel 2 hours ago
Comment by sgarland 15 minutes ago
Integer PKs were seen as fine for years - decades, even - before the rise of UUIDs.
Comment by sagarm 44 minutes ago
Comment by ivan_gammel 42 minutes ago
Comment by K0nserv 2 hours ago
0: https://bsky.app/profile/hugotunius.se/post/3m7wvfokrus2g
1: https://github.com/postgres/postgres/blob/master/src/backend...
Comment by grugdev42 2 hours ago
id => 123, public_id => 202cb962ac59075b964b07152d234b70
There are many ways to generate the public_id. A simple MD5 with a salt works quite well for extremely low effort.
Add a unique constraint on that column (which also indexes it), and you'll be safe and performant for hundreds of millions of rows!
Why do we developers like to overcomplicate things? ;)
Comment by Retr0id 1 hour ago
Comment by Denvercoder9 2 hours ago
Comment by sgarland 12 minutes ago
For Aurora MySQL, it just makes it worse either way, since there’s no change buffer.
Comment by hnfong 1 hour ago
Comment by stevefan1999 54 minutes ago
Comment by mcny 3 hours ago
Comment by dimitrisnl 3 hours ago
Comment by alerighi 3 hours ago
Of course this is not always the case that is bad, for example if you have a lot of relations you can have only one table where you have the UUID field (and thus expensive index), and then the relations could use the more efficient int key for relations (for example you have an user entity with both int and uuid keys, and user attribute references the user with the int key, of course at the expense of a join if you need to retrieve one user attribute when retrieving the user is not needed).
Comment by torginus 2 hours ago
Comment by edding4500 3 hours ago
original answer: because if you dont come up with these ints randomly they are sequential which can cause many unwanted situations where people can guess valid IDs and deduce things from that data. See https://en.wikipedia.org/wiki/German_tank_problem
Comment by javawizard 3 hours ago
Edit: just saw your edit, sounds like we're on the same page!
Comment by javaunsafe2019 3 hours ago
Comment by jcims 3 hours ago
Comment by dsego 2 hours ago
Comment by grim_io 3 hours ago
One more reason to stay away from microservices, if possible.
Comment by bardsore 3 hours ago
Comment by grim_io 2 hours ago
Comment by mrkeen 2 hours ago
If you separate them (i.e. microservices) the they no longer try to use one db.
Comment by grim_io 2 hours ago
Sometimes it might even be for a good reason.
Comment by BartjeD 2 hours ago
But that also adds complexity.
Comment by parpfish 5 minutes ago
- If you use uuids as foreign keys to another table, it’s obvious when you screw up a join condition by specifying the wrong indices. With int indices you can easily get plausible looking results because your join will still return a bunch of data
- if you’re debugging and need to search logs, having a simple uuid string is nice for searching
Comment by bux93 3 hours ago
Aren't people using (big)ints are primary keys, and using UUIDs as logical keys for import/export, solving portability across different machines?
Comment by Sayrus 3 hours ago
Comment by bruce511 2 hours ago
Consider say weather hardware. 5 stations all feeding into a central database. They're all creating rows and uploading them. Using sequential integers for that is unnecessarily complex (if even possible.)
Given the amount of data created on phones and tablets, this affects more situations than first assumed.
It's also very helpful in export / edit / update situations. If I export a subset of the data (let's say to Excel), the user can edit all the other columns and I can safely import the result. With integer they might change the ID field (which would be bad). With uuid they can change it, but I can ignore that row (or the whole file) because what they changed it to will be invalid.
Comment by nrhrjrjrjtntbt 2 hours ago
Comment by jwr 3 hours ago
(in the scientific reporting world this would be the perennial "in mice")
Comment by hyperpape 3 hours ago
It would be the equivalent of "if you're a middle-aged man" or "you're an American".
P.S. I think some of the considerations may be true for any system that uses B-Tree indexes, but several will be Postgres specific.
Comment by orthoxerox 3 hours ago
If you use UUIDv7, you can partition your table by the key prefix. Then the bulk of your data can be efficiently skipped when applying updates.
Comment by kijin 3 hours ago
Just the other day I delivered significant performance gains to a client by converting ~150 million UUIDv4 PKs to good old BIGINT. They were using a fairly recent version of MariaDB.
Comment by splix 2 hours ago
Comment by zelphirkalt 3 hours ago
if they can live with MariaDB, OK, but I wouldn't choose that in the first place these days. Likely Postgres will also perform better in most scenarios.
Comment by kijin 2 hours ago
Had the requirements been different, UUIDv7 would have worked well, too, because fragmentation is the biggest problem here.
Comment by reactordev 2 hours ago
The UUID is 128bits. The first 64bits are a java long. The last 64bits are a java long. Let's just combine the Tenant ID long with a Resource ID long to generate a unique id for this on our platform. (worked until it didn't).
Comment by nrhrjrjrjtntbt 2 hours ago
Comment by reactordev 24 minutes ago
Comment by hk1337 2 hours ago
Comment by mkleczek 1 hour ago
Hash index is ideally suited for UUIDs but for some reason Postgres hash indexes cannot be unique.
Comment by Lucasoato 3 hours ago
Also is it necessary to show uuid at all to customers of an API? Or could it be a valid pattern to hide all the querying complexity behind named identifiers, even if it could cost a bit in terms of joining and indexing?
The context is the classic B2B SaaS, but feel free to share your experiences even if it comes from other scenarios!
Comment by dfox 3 hours ago
While that is often neat solution, do not do that by simply XORing the numbers with constant. Use a block cipher in ECB mode (If you want the ID to be short then something like NSA's Speck comes handy here as it can be instantiated with 32 or 48 bit block).
And do not even think about using RC4 for that (I've seen that multiple times), because that is completely equivalent to XORing with constant.
Comment by kaladin_1 2 hours ago
I would like to hear from others using, for example, Google Spanner, do you have issues with UUID. I don't for now, most optimizations happen at the Controller level, data transformation can be slow due to validations. Try to keep service logic as straightforward as possible.
Comment by p2detar 1 hour ago
> We need to read blocks from the disk when they are not in the PostgreSQL buffer cache. Conveniently, PostgreSQL makes it very easy to inspect the contents of the buffer cache. This is where the big difference between uuidv4 and uuidv7 becomes clear. Because of the lack of data locality in uuidv4 data, the primary key index is consuming a huge amount of the buffer cache in order to support new data being inserted – and this cache space is no longer available for other indexes and tables, and this significantly slows down the entire workload.
Comment by scary-size 3 hours ago
Comment by raxxorraxor 2 hours ago
The issue is that is true for more or less all capability URLs. I wouldn't recommend UUIDs per se here, probably better to just use a random number. I have seen UUIDs for this in practice though and these systems weren't compromised because of that.
I hate the tendency that password recovery flows for example leave the URL valid for 5 minutes. Of course these URLs need to have a limited life time, but mail isn't a real time communication medium. There is very little security benefit from reducing it from 30 minutes to 5 minutes for example. You are not getting "securer" this way.
Comment by cebert 3 hours ago
Comment by andatki 48 minutes ago
Comment by mexicocitinluez 1 hour ago
The ability to know ahead of time what a primary key will be (in lieu of persisting it first, then returning) opened up a whole new world of architecting work in my app. It made a lot of previous awkward things feel natural.
Comment by sgarland 2 minutes ago
Comment by socketcluster 3 hours ago
I'm tired of midwit arguments like "Tech X is N% faster than tech Y at performing operation Z. Since your system (sometimes) performs operation Z, it implies that Tech X is the only logical choice in all situations!"
It's an infuriatingly silly argument because operation Z may only represent about 10% of the total CPU usage of the whole system (averaged out)... So what is promoted as a 50% gain may in fact be a 5% gain when you consider it in the grand scheme of things... Negligible. If everyone was looking at this performance 'advantage' rationally; nobody would think it's worth sacrificing important security or operational properties.
I don't know what happened to our industry; we're supposed to be intelligent people but I see developers falling for these obvious logical fallacies over and over.
I remember back in my day, one of the senior engineers was discussing upgrading a python system and stated openly that the new version of the engine was something like 40% slower than the old version but he didn't even have to explain himself why upgrading was still a good decision; everybody in the company knew he was only talking about the code execution speed and everybody knew that this was a small fraction of the total.
Not saying UUIDv7 was a bad choice for Postgres. I'm sure it's fine for a lot of situations but you don't have to start a cult preaching the gospel of The One True UUID to justify your favorite project's decisions.
I do find it kind of sly though how the community decided to make this UUIDv7 instead of creating a new standard for it.
The whole point of UUID was to leverage the properties of randomness to generate unique IDs without requiring coordination. UUIDv7 seems to take things in a philosophically different path. People chose UUID for scalability and simplicity (both of which you get as a result of doing away with the coordination overhead), not for raw performance...
That's the other thing which drives me nuts; people who don't understand the difference between performance and scalability. People foolishly equate scalability with parallelism or concurrency; whereas that's just one aspect of it; scalability is a much broader topic. It's the difference between a theoretical system which is fast given a certain artificially small input size and one which actually performs better as the input size grows.
Lastly; no mention is made about the complex logic which has to take place behind the scenes to generate UUIDv7 IDs... People take it for granted that all computers have a clock which can produce accurate timestamps where all computers in the world are magically in-sync... UUIDv7 is not simple; it's very complicated. It has a lot of additional complexity and dependencies compared to UUIDv4. Just because that complexity is very well hidden from most developers, doesn't mean it's not there and that it's not a dependency... This may become especially obvious as we move to a world of robotics and embedded systems where cheap microchips may not have enough Flash memory to hold the code for the kinds of programs required to compute such elaborate IDs.
Comment by thraxil 1 hour ago
Comment by kunley 3 hours ago
Comment by christophilus 2 hours ago
Comment by danparsonson 1 hour ago
To your original point, these are heuristics; there isn't always time to dig into every little architectural decision, so having a set of rules of thumb on hand helps to preempt problems at minimal cognitive cost. "Avoid using a GUID as a primary key if you can" is one of mine.
Comment by kunley 2 hours ago
Comment by socketcluster 2 hours ago
If the client POSTs a new object to insert it into the database; if there is a connection failure and the client does not receive a success response from the server, the client cannot know whether the record was inserted or not without making an expensive and cumbersome additional read call to check... The client cannot simply assume that the insertion did not happen purely on the basis that they did not receive a success response. It could very well be that the insertion succeeded but the connection failed shortly after so response was not received. If the IDs are auto-incremented on the server and the client posts the same object again without any ID on it, the server will create a duplicate record in the database table (same object with a different ID).
On the other hand, if the client generates a UUID for the object it wants to create on the front-end, then it can safely resend that exact object any number of times and there is no risk of double-insertion; the object will be rejected the second time and you can show the user a meaningful error "Record was already created" instead of creating two of the same resource; leading to potential bugs and confusion.
Comment by kunley 58 minutes ago
Comment by socketcluster 44 minutes ago
Comment by mkleczek 1 hour ago
Using idempotency identifier is the last resort in my book.
Comment by socketcluster 49 minutes ago
Comment by old8man 1 hour ago
Comment by sobakistodor 2 hours ago
Comment by oldpersonintx2 4 hours ago
Comment by waynenilsen 3 hours ago
Comment by mrits 2 hours ago