Frage

I have a problem I'd like to code. I have a process which generates numbers 0 through n-1 and I want to stop it when it generates the first repeated element.* I'm looking for a data structure that makes this fast. In particular, adding a new element and testing if an element is in the structure need to be fast. The expected number of insertions is around sqrt(n) (birthday problem) or actually a bit worse (say sqrt(2n)) because the process slightly favors unique values. In other words, it is rather sparse -- working with the numbers up to a billion only about 30 or 50 thousand values will be used.

A hash set or some kind of self-balancing binary tree seems like the right approach, but maybe there's a better way? For small n I think a bit array would be superior but I'm looking at n around 10^9 which is too large for that to be practical I think.

* Actually, it doesn't need to stop right away -- if it's more efficient you can generate elements in blocks and check every now and then.


Note: This was originally posted on math.se but they recommended that I repost here. It's not research-level and so not suitable for cstheory.se.

War es hilfreich?

Lösung

A hash table is indeed the way to go. A properly optimized hash set of integers can be almost (can't quite ignore the load factor) as space efficient as a plain array while retaining the high performance you'd expect. Use the key as hash value, don't store the hash value twice, keep the table size a power of two (and hence use a bit mask instead of modulo). If you use open addressing and need deletion, you can borrow a bit from the key to mark tombstones.

For 50k items, these optimizations are probably not worth writing your own hash table (though it's a fun exercise in its own right!). If you can use the existing hash set in your language of choice, use it. Otherwise, see Fast and Compact Hash Tables for Integer Keys for a survey and benchmark of various approaches, and consider Robin Hood Hashing which is very easy to implement, has decent worst-case guarantees, and although it's not mentioned in the aforementioned paper, it's quite fast in my experience (mostly because it's a simple modificationof linear probing and inherits its advantages). My C implementation — unfortunately not public yet — is not even 250 lines of code including blanks and comments, none of which are tricky (in comparision to other hash tables). This is without any micro optimizations.

Andere Tipps

I think the best data structure is hashTable. And the most important part is hash function either you can create your own or you can use MurmurHash/ CityHash for uniform distribution.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top