Building a “sparse” lookup array minimizing memory footprint

https://stackoverflow.com/questions/9200732

18-05-2021
|

Question

let's say I want to build an array to perform a lookup to parse network protocols (like an ethertype). Since such an identifier is 2-byte long, I would end up with a 2^16 cells array if I use direct indexing: this is a real waste, because it is very likely that the array is sparse - i.e. lots of gaps into the array.

In order to reduce memory usage to the maximum, I would use a perfect hashing function generator like CMPH, so that I can map my "n" identifiers to a n-sized array without any collision. The downside of this approach is that I have to rely on an external "exoteric" library.

I am wondering whether - in my case - there are smarter ways to have a constant time lookup while keeping at bay memory usage; bear in mind that I am interested in indexing 16-bit unsigned numbers and the set size is quite limited.

Thanks

Solution

Since you know for a fact that you're dealing with 16-bit values, any lookup algorithm will be a constant-time algorithm, since there are only O(1) different possible values. Consequently, algorithms that on the surface might be slower (for example, linear search, which runs in O(n) for n elements) might actually be useful here.

Barring a perfect hashing function, if you want to guarantee fast lookup, I would suggest looking into cuckoo hashing, which guarantees worst-case O(1) lookup times and has expected O(1)-time insertion (though you have to be a bit clever with your hash functions). It's really easy to generate hash functions for 16-bit values; if you compute two 16-bit multipliers and multiply the high and low bits of the 16-bit value by these values, then add them together, I believe that you get a good hash function mod any prime number.

Alternatively, if you don't absolutely have to have O(1) lookup and are okay with good expected lookup times, you could also use a standard hash table with open addressing, such as a linear probing hash table or double hashing hash table. Using a smaller array with this sort of hashing scheme could be extremely fast and should be very simple to implement.

For an entirely different approach, if you're storing sparse data and want fast lookup times, an option that might work well for you is to use a simple balanced binary search tree. For example, the treap data structure is easy to implement and gives expected O(log n) lookups for values. Since you're dealing with 16-bit values, here log n is about 16 (I think the base of the logarithm is actually a bit different), so lookups should be quite fast. This does introduce a bit of overhead per element, but if you have only a few elements it should be simple to implement. For even less overhead, you might want to look into splay trees, which require only two pointers per element.

Hope this helps!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow