سؤال

I have a set of 1,000,000 unique numbers. The numbers are in the interval between 0 an 50,000,000. Consider, the numbers are random. I need a data structure which would hold them all. The data-structure should require as little memory as possible. It should be possible to find quickly whether the number is in the set with no errors.

I found a solution with bloom filter. Yes, bloom filter has a probability of false positives, but since there are "just" 50,000,000 possible numbers, I can find all the errors and keep them in the std::set. By this method, I'm able to store all the numbers in 2.3MB of memory.

Can you find a better method?

هل كانت مفيدة؟

المحلول

Rather than a range of 0 to 50,000,000, how about 1,024 separate ranges of 65,536? That'd give you a 64 MB range. I suppose you can make it 763 rather than 1,024, which will give you 50,003,968.

Something like ushort[763][];

Now you're storing 1,000,000 16-bit values rather than 32-bit values.

The values in the rows are sorted. So to determine if a number is in the set, you divide the number by 763 to figure out which array to look in, and then do a binary search on number % 65536.

Storage for the numbers themselves is 2,000,000 bytes. Plus a small amount of overhead for the arrays.

This will be faster in execution, smaller than your Bloom filter approach, no false positives, and a whole lot easier to implement.

نصائح أخرى

The minimum space to store such a vector in general is 884,002 bytes. That stores an integer index (a very large integer) into the list of all possible choices of 1,000,000 out of 50,000,000.

You can get close to that with a simple, fast byte encoding. Given the sorted list of numbers, replace each number with the difference from the last number. (Assume that -1 precedes the first number.) The differences are all one or more, so subtract one. If the result is 254 or less, code it as a byte. Otherwise, write 255, and follow with two bytes with a larger difference minus 255. If it doesn't fit in that, then write three 255's, and follow with three bytes with the difference. This will almost always code the vector in less than 1,012,000 bytes.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top