How to check for ID uniqueness on large dataset that can't fit into memory or on a single disk

https://softwareengineering.stackexchange.com/questions/391536

24-02-2021
|

Question

Say I have 500MB of memory to work with, but 100 terabytes of IDs I want to generate. I want these IDs (GUIDs) to be randomly selected and applied to records, so they shouldn't appear in order when selected. Also, I have 1 billion petabytes (in this example) of possible IDs, which is way too many to actually generate. So I only select 100 terabytes initially out of the possible space. I'm not sure how many IDs that would be, but say it's on the order of 10^32 or something large (I don't know the exact number). So basically, this is the situation:

A gigantic space of possible IDs (10^32, or more precisely I am focusing on the number being 1 billion petabytes, an unreasonably large amount of data).
A subset of these IDs which are randomly selected. In this case, 100 terabytes of IDs.
A computer that is limited to 500MB or so, so all the possible IDs needed can't possibly fit into memory.
The computer only has 10GB of disk space.

The question is how to architect a fast system for generating these IDs.

After some thought my approach was to consider generating a Trie. Say the IDs are 32 characters matching /[0-9]/. Then we would generate the equivalent of 100 terabytes in the trie, 1 character at a time. This removes the need of storing duplicate characters by probably a few orders of magnitude of memory. But still it would require about 100 terabytes of memory to construct the trie, so that doesn't work.

So basically in the extreme case where we can't possibly store 100TB on the computer or even on a few external hard drives, we need the "cloud" to solve this. That is, we need to somehow use the cloud to check if the IDs have already been used, and then to generate one if it's not been created or used.

The question is how to optimally do this so it takes the least amount of time to generate all the IDs.

Not looking for answers like "don't worry about duplicates", or "that's too many IDs to generate". I would like to know how to specifically solve this problem.

What I have resorted to is basically:

Check in "cloud" database if record exists.
If so, try a different value and repeat.
If not, then save it to the database.

But this is extremely slow and would take weeks of computer time to run and generate the IDs. It seems like there could be a data structure and algorithm to make this faster.

Solution

If you just need the numbers to appear to be random, I have an idea.

The following is just a concept. It is by no means the only (or correct!) way of doing this, and you'll need to tweak it a bit to fit your needs. If you want to use this, use it as a starting point only.

You start with a single arbitrarily-sized number-storing variable set to an arbitrary number. That will be your seed.

Let's use "123456789" as an example.

If you consider our seed an array of characters = {1,2,3,4,5,6,7,8,9}, we can use a simple composition to get back a seemingly random number by just swapping those around in a fixed manner:

 ID1 = s[2] + s[6] + s[0] + s[1] + s[8] + s[3] + s[5] + s[4] + s[7];

You'll end up with the number "371294658", which appears to be random. After you generate this ID, just add 7 (or any large odd number of your liking) to your seed:

Seed = "123456796"

And swap those around again to get your next ID, using the same positions swaps as before:

ID2 = "371264659"

And so on.

Seed = "123456803"
ID3 = 381234650.

If you look at the IDs at a glance, they appear to be close, but the order in which they were generate isn't evident at all without further examination.

ID1 = 371294658
ID2 = 371264659
ID3 = 381234650

The numbers are going up and down, there are gaps between then, and so on. While not impossible (nor hard) for a clever attacker to figure out what we are doing here, for most purposes it provides a non-sequential, unique number for you to play with, and it is a pretty fast way to do so. More importantly, it allows your backend to "unpack" this ID to restore the original sequential number if the need ever arises. More so, it creates the possibility of checking for tampering by looking at IDs that don't follow the generator's rules.

I made a small prototype of this thing on LinqPad, using a rather large number as a "salt". Look at the results I got:

OTHER TIPS

This smells of homework or an interview question, especially the "10 GB of disk space" part. But here goes:

Generate your IDs using a cryptographically-strong random number generator.
Use your 10GB of disk space as a bitmapped Bloom filter:
- Hash the generated ID and mod 80G (8 bits per byte)
- Determine the status of the bit corresponds to this value.
- If the bit is clear, mark it and use the ID
- If the bit is already set, go back to step 1

What you can do is generate an ordered list of guid's and then randomly reading from that ordered list, marking any id you used and skipping id's that have been taken. Whenever you pick an already used id, skip to the next.

The making of the ordered list is an O(n) operation and the picking a random item is an O(1) operation.

If you want to generate id's randomly and check for uniqueness it's going to be probably an O(n log(n)) operation, which will take quite long considering your amounts of data.

If the ID's only need to appear random to a casual observer, then I would use an algorithm like this:

Sequentially generate numbers (remember where you left off in a persistent manner, so a reboot doesn't restart the sequence)
Use the last generated number to create a Name-Based UUID according to RFC4122. The hashing that takes place as part of the UUID creation will make the UUID appear to be randomly generated.
Present the UUID to the user as a 32-character string with symbols chosen from an alphabet with 16 symbols of your choice. There should be a 1-to-1 correspondence between the symbols in your alphabet and the standard hexadecimal alphabet, but that does not have to be apparent.

Use a UUID:

UUID id = UUID.randomUID();

This probably is already unique. But you can hold the most recent UUIDs by timestamp:

long timestamp = id.timestamp();

and check those for uniqueness. Estimate the throughput for the size of the latest moment. The latest timestamp must be persistent.

An other approach would be to have a file with a bit for every id, and do random access, maybe using one or more memory mapped buffers.

These requirements seem arbitrary and your goals are unclear but one approach could be to use a secure PRGN and a Bloom Filter. A bloom filter will use a very small fraction of the space required for a trie. The main concern here is your false positive rate. There's a bit of homework here to figure out the size of the filter and your hash functions.

A very simple solution: Keep a count of the number of hashes issued so far, n. Generate the the high order bits of each hash as a pseudo-random number; and set the low order bits to n.

All hashes will be guaranteed unique, and their ordering will appear random.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange