Questioning the impact of UUIDs

https://softwareengineering.stackexchange.com/questions/350129

13-01-2021
|

Pergunta

When I first played with a NoSQL database I become aware of the impact of UUIDs in a distributed system.

MongoDB defaults to ObjectIDs, but I've always questioned in which cases UUID (RFC4122) would be a better choice.

I found out that ObjectIDs are great. Not only they are smaller than UUIDs saving disk space, but they are overall more efficient:

Once I was told:

contrary to UUIDs, ObjectIds are monotonic ... Monotonic indexes will cause the B-Tree to be filled more efficiently, it allows paging by id and allows a 'default sort' by id to make your cursors stable, and of course, they carry an easy-to-extract timestamp. These are the optimizations you should be aware of, and they can be huge.

Suddenly, I've been thinking that while UUIDs might be a a YAGNI case, Mongo ObjectIDs might be premature optimization.

If I default to UUIDs first, someone might claim that I am wasting performance/disk space. However if I choose ObjectIDs first, I might find out later that I need a less collision risky ids. The former is about performance, the latter about design limitations.

Because of my lack of experience with UUIDs, I am not sure if should worry more about performance or freedom.

Which one should be the default ID strategy in projects where the requirements are not clear yet?

UPDATE

I am concerned about a database vendor feature (ids) leaking into my application layer. Am I being too paranoid by sacrificing efficiency for the sake of independence / abstraction ?

Solução

On the question of whether not to use UUIDs because 'You Aint Gona Need It'

Given the wide acceptance of the UUID standard and the many standard library's which can generate them; UUID is usually the easiest way of generating a unique id.

Due to this UUID should be the default option for any id. You can make that decision at the design stage without considering database technology or architecture.

Any other Id format should be a forced choice due to performance or cost.

Given the distributed nature of no-sql and the low cost of disc space generally these are problems you are unlikely to encounter.

Outras dicas

So a mongo objectId is a client side generated semi random number

4 byte seconds since unix epoch
3 byte machine id
2 byte process id
3 byte counter

But if we consider the case where each client creates a record every second, (and the counter is implement as a counter rather than random) that guarantees a collision on 7 out of the 12 bytes.

Leaving you with only 5 bytes of random number. 3 for machine id (16mil) which presumably stay the same for each client

and 2 (65k) which are presumably 'rerolled' each time the client starts up.

So you have a low, predicable chance of client collision. But if you don't own the client machine its uncontrollable.

The process id is almost certainly going to collide at some point depending on how often its regenerated.

So, in this senario you are really down to seeing intermittent collisions, and hence errors, where machine id's match.

If you are deploying to many machines, say its a mobile phone app with a customer base in the thousands. You would have to consider the risk I think.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange