Minimal perfect hash for N number of unknown keys

https://stackoverflow.com/questions/19825366

05-07-2022
|

Question

I have two unsorted arrays of 32-bit unsigned integers, size N1 and N2, respectively. Each array may contain duplicates. I would like to map each value (2^32 possible keys) to a spot in a byte-array of size (N1 + N2) to record frequencies of each key. Duplicate key values should map to the same position in this array. Additionally, the frequency of each integer won't go above 100 (which is why I chose a byte-array to record each key's frequency to save space); if the max possible frequency were to go above this, I would simply change the byte-array to an array of shorts or something.

In the end, I need an array of size N1 + N2 -- not necessarily all entries will be used, as duplicates may have been encountered -- with frequencies of each unique key value. Worst case scenario, only one byte entry will be used (e.g. all values in both arrays are the same) leaving ((N1 + N2) - 1) entries unused. Best case scenario, all byte-entries are used.

From what I understand, I need to find a minimally perfect hashing function to map a known number of unknown keys (N1 + N2; all ranging from 0 - 2^32) to a known number of spots (N1 + N2). I was able to find a few other posts, but both answers basically said use gperf:

Is it possible to make a minimal perfect hash function in this situation?

Minimal perfect hash function

The second one (Minimal perfect hash function) is exactly what I'm attempting to do.

Rather than expecting source code from an answer (I'm using C by the way), I'd much prefer an explanation of how to go about creating a minimally perfect hashing function for N-number of any possible positive integers to N buckets. I could easily do this with a 4 GB array of direct mappings for every possible integer with lots of unused space, but I'd rather try to reduce this massive inefficiency of space. I'm also hoping to not use any external libraries, mostly for educational purposes to learn more about hashing, itself.

Solution

This is clearly impossible. If you have N numbers, there's no way to come up with a function which will hash them all to distinct values in the range [0, N) unless you know what those numbers are going to be beforehand. Otherwise, given any such function (with N < 2^32, of course), there will be at least one pair of integers such that both of those integers hash to the same value, so that function won't be perfect if those integers both show up in the input.

If you relax the conditions to allow the function to be created on the fly, this becomes possible, but only in a really trivial and useless way. Namely, a hash function could build itself up as it goes by recording each number that's fed into it and generating a new unique output for each one (say, counting up from 0). But such a function would need a hash table (or something equivalent) as part of its implementation, so it'd certainly be no use in implementing a hash table!

OTHER TIPS

According to the Pigeonhole Principle, you will have "hash slots" occupied by more than one number. In other words: different numbers will "hash" to the same value.

Now, I wonder if you could benefit from a Bloom Filter. From Wikipedia:

False positive matches are possible, but false negatives are not; i.e. a query returns either "possibly in set" or "definitely not in set".

If something is "definitely" not in the set of keys, you can move on (its frequency is one), and if it possibly is in the set, then you process it further to accumulate its actual statistic.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow