Fast hashing algorithm for mapping/quantizing a collection of floats onto a small set of sorted floats
-
29-09-2020 - |
Question
With my application, I have
- Collection X: thousands of floating-point numbers with the value range [0, 1], not sorted.
- Collection Y: 11 floats ranging from [0, 1], sorted.
- The size of X is known. Let it be L.
The goal is to quantize X and map it onto Y, so that we get a hash array of indices of Y for X. Eventually Y will be then quantized onto 10 discrete things pointed to it.
An example to make it a bit clearer,
Y = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
X = [0.678, 0.124, ..., 1.0, ., 0.013, 0.475]
I want the algorithm output to be 0-based indices like these:
H(Y[0]) = H(0.678) = 6
H(Y[1]) = H(0.124) = 1
H(Y[n-2]) = H(0.013) = 0
H(Y[n-1]) = H(0.475) = 4
Attempts
Naively, I've tried linear and binary searching for positioning each element of X in Y so that the element is found between an adjacent pair of elements in Y.
However, the performance is not good enough for my application. This quantization happens in a realtime thread that slow computation is undesirable.
Question
What is the best way of this kind of hashing/quantization? X is not sorted.
Thanks!
Solution
Take X, multiplied by say 10,000, rounded down to the nearest integer. In plain C, z = (int) (x times 10000.0).
There are 10,000 possible values of z. For most values of z, you can determine the index from z. So create a table with 10,000 entries. In the table, store an index i if you can prove that x should be mapped to i, knowing z, and store -1 if you can’t prove this.
As a result, you get the correct value probably in 9,980 of 10,000 values, and then you use whatever slow algorithm you have for the remaining 1 in 500 values.
PS. The same table size would be used for double precision numbers. Whatever the table size, there will be only few values X that cannot be mapped correctly using this method, maybe 10 or 20. If you take a table of size 10,000 then 99.8% or 99.9% are mapped correctly, and 0.1% or 0.2% need a slow algorithm. The exact same thing happens with double. You could use 1000 entries, then the 10 or 20 failing ones would be 1% or 2%.
And a nice thing is that this method will work however the Y values are distributed. Only if the number of Y values is larger, then you might want to increase the table size.