Fastest way to determine if a value is in a list?

https://softwareengineering.stackexchange.com/questions/280339

08-10-2020
|

Question

I'm aware there may not be a good solution to this problem. I'm looking for a fast algorithm that determines whether or not a value (let's say an integer) exists in a set of values, however I need the algorithm to also be fast if the value is not in the set. I.e.: Most methods I can think of right now that determine if a value is in a list will take longer to determine that it is not as they will not exit early.

I've been experimenting with using the keys of an unordered set but found them to actually be rather slow to verify. I'm not really sure what's going on behind the scenes with this data type. I'm aware that a straight forward array search might be faster than looking up a key but the size of the array is not properly known yet, only that it will be very large.

Properties and assumptions:

Assume all the values are integers between 0 and N, where N is likely to be in the millions.
The list will need to be checked very quickly, several 100 times per millisecond, but usually roughly in the same place each time with only a gradual deviation in the target area.
There is no constraint on the preparation time available for setting up a data structure, i.e.: the list could be organised specially somehow before hand.
There will be no duplicate values in the data.

Solution

If there is a known maximum N, you can use a Bit array for really fast lookup time.

Simply keep an array of size N/8 (rounded up) around, with each bit corresponding to a number, 1 if it is in the set, 0 if it isn't. Lookup the relevant bit to check whether a number is in the set.

If this is too slow, and you have the megabytes ("millions" doesn't sound very high), then just keep a boolean array of size N.

Edit: for much larger values of N this won't fit into memory, and you probably need some sort of hashing. It'd still be nice to use the fact that subsequent calls are likely to be close to each other, though.

So a sketch of an idea: use the number divided by 256 as the key, and store a 256-bit bit array as the value (for all the numbers that divide to this key). Store the most recently retrieved key and bit array. If the next call divides to the same key, you can look it up in the cache bit array immediately. The number 256 can be tweaked a lot, of course. I'm not sure this is actually faster than just using map without extra gimmicks, so profile it.

For large values of N it's going to depend on lot on the sparseness of the data; clearly if you have max N 10^12 and a lot of the values are in the set, then it's not going to fit into memory anyway.

OTHER TIPS

You could generate a binary tree (radix tree aka trie). To prepare for the search, for each number in your array, starting with the least significant bit, if it is 0, you look to the left node. If it is 1, you look to the right node. You follow the tree until you find no more nodes, and from there you create the tree one node at a time.

To search, taking each bit, starting with the least significant bit, if bit is 0, go left. if bit is 1, go right. No node found? Your number isn't in the tree, you're done. Otherwise, continue with the next bit on that node. If you run out of bits and you are on a node, then you've found your number.

It takes O(nlog n) to construct binary tree and O(log n) to determine if any number exists. It should be faster than a traditional hash map, but it is also more restrictive so this implementation would only work with numbers.

The reason why you'd start with the least significant bit is to keep the table somewhat balanced (otherwise a good many of the numbers would likely start with 0). Hope that helps!

I would try to go for the bitfield too, but if the data is sparse and the range too great to make this work, a splay tree might be a good data structure to use.

Splay trees are modified on each access (so they are not thread-safe, which could be an exclusion reason for you) to optimize repeated access to the same element. Often-used elements bubble to the top. If your access pattern really stays in the same area for a while then this could make for fast lookup in your case.

But this depends on usage pattern. You still have to traverse the entire tree height for non-existent elements.

We were told numbers could be up to about 10^12 and about 20% of numbers would be used. That is 2 * 10^11 numbers actually used. With these numbers, the best you can do is a bitmap of about 125 GBytes, and if you can't afford the RAM, an SSD drive and virtual memory will have to do. You should probably measure whether "normal" file access is faster than relying on virtual memory. Obviously a bitmap that exceeds physical memory will be slooooooow. It's still the fastest that you can do.

If the map is less dense then you can do it quicker. Say you have a billion numbers from 0 to 10^12. So on average a range of thousand numbers only contains one number. You could have a bitmap to represent the fact that "the range 128k to 128k + 127 contains at least one number". That takes 8 billion bits or one GB, fitting into RAM, so you check this map first, chance is 87% that you find there is no number, so whatever slower algorithm you wanted to use will only be used for one in eight numbers.

Obviously your access pattern will be incredibly important. If 99% of the time you are searching for numbers that are actually present then the method described above will be useless.

The most efficient and general-purpose solution I've come up with for this problem, and it works specifically for integers (though still widely applicable in my case since I refer to just about everything by index when using auxiliary data structures to accelerate searches), is what I call a "sparse bit tree". I don't know if there's a formal name for this data structure. It can perform set operations in better than linear time, without having to inspect all elements (bits) of both sets, since it can, say, skip over a million elements if it just finds a set bit (1) at the root level in both sets and know immediately that both sets have a million elements in common with a test for a single bit.

As a result it has a best case performance of something like Log(N) / Log(64), where it can find the set intersection between two dense sets containing a million elements each in 3-4 iterations, with each iteration only having to check one bit. The worst case scenario is N * Log(N) / Log(64), where it might have to perform around 3 to 4 million iterations to determine the set intersection between both sets, but that's still linear-time and it doesn't happen unless the data is very awkwardly sparse with no contiguous indices (like a case where every integer in the set is an even number).

I wouldn't be surprised if someone else has already come up with the same idea long before me (if anyone knows the name if so, I'd much appreciate it). It's just something I made up and have been using for ages (the original version used 16-bit words and has evolved towards 64-bit).

It is not the most trivial data structure to implement, but has served me well as the backbone for so many projects. The diagram should give enough of an idea to implement for those who are interested. It can be used as an integer/index set with set operations. Unlike most sets, it can also be used to rapidly find an unused integer/index (rapidly find the first integer not in the set) using ffs instructions with the complement and figure out what index to to insert an element to in an array with "holes" in it to allow rapid removal and to avoid invalidating indices to other elements, etc.

I actually found it so useful that I think it has shaped the way I design codebases against it to the point where I now want to store indices to everything stored in central arrays and avoid pointers when possible. There's that whole tendency to see everything as a nail when you're wielding a hammer, but in this case it's an extremely efficient hammer that trivializes getting a performant solution. I ended up finding even more use for it than ever before after working my way towards entity-component systems which want to perform set intersections all the time ("find all entities that implement X,Y,Z components").

Loop Acceleration

The rapid linear traversal as noted on the bottom right is a huge one for me, since I have many loops in many projects I've made where they inevitably can't do better than linear (for each element in the set, do some light work type loops). This often makes the loop cheaper than iterating through an array of indices because it can often do it while accessing so little memory when a single bit set towards the top of the hierarchy can indicate that indices in the range, [8388608,16777216) are in the set by just inspecting one bit, e.g., as opposed to over 8 million bits or, worse, over 8 million integers.

As a result I often use this structure even when I don't need a set as a replacement for, say, an array of integers (int[]) or vector<int> if I'm going to be looping over the result many times over. It serves as a loop accelerator when the indices consist predominantly of contiguous ranges since in those cases, it's much faster to iterate through than an array of integers. Even disregarding that, it also means the loop will access the original elements in sorted order which is typically the most cache-friendly access pattern as opposed to sporadic random access you might get by just looping through an array of unsorted indices.

Sparse Data

And because it's reasonably well-suited for sparse data, you don't have to worry too much about hogging a boatload of memory if the indices/integers are very sparse and have a huge range (as long as it's not too huge, a range in the millions is perfectly fine, billions or trillions might call for something better suited for extremely sparse representations).

Temporal Locality

Now funnily enough to test the presence of a single integer, it has a worse algorithmic complexity with a worst case of something like log(N) / log(64) iterations (or bitwise operations) than a simple huge array of bits (bitset) for which you can test a specific bit in constant time with a single bit operation and some arithmetic (or a bit shift and a bitwise and). However, it tends to still be faster because of the temporal locality associated with being able to often determine whether an nth bit is set without drilling down to the leaves, especially in cases which the indices are reasonably dense (for super sparse cases, you might just go straight to the leaf nodes and treat this like it's a gigantic array of bits instead of a tree).

Just to add a suggestion to the other solutions.

In the biomedical world we often search for specific strings of data (ie genetic patterns).

Research has shown that attacking the parent string in the reverse direction, the last entry to the first, is faster that going forward through the string.

I guess this is some how related to slurping in all the data into memory, it then being quicker to pop the last entry off the end.

I remember being really chuffed when I had this idea, I thought I was thinking well outside the box.

I expect the same principle would apply to simple arrays also where the search time is proportional to the size.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange