For what kind of data are hash table operations O(1)?

https://cs.stackexchange.com/questions/477

16-10-2019
|

Pergunta

From the answers to (When) is hash table lookup O(1)?, I gather that hash tables have $O(1)$ worst-case behavior, at least amortized, when the data satisfies certain statistical conditions, and there are techniques to help make these conditions broad.

However, from a programmer's perspective, I don't know in advance what my data will be: it often comes from some external source. And I rarely have all the data at once: often insertions and deletions happen at a rate that's not far below the rate of lookups, so preprocessing the data to fine-tune the hash function is out.

So, taking a step out: given some knowledge about data source, how can I determine whether a hash table has a chance of having $O(1)$ operations, and possibly which techniques to use on my hash function?

Solução

There are several techniques that guarantee that lookups will always require O(1) operations, even in the worst case.

How can I determine whether a hash table has a chance of having O(1) operations, and possibly which techniques to use on my hash function?

The worst case happens when some malicious attacker (Mallory) deliberately gives you data that Mallory has specifically selected to make the system run slow.

Once you have picked some particular hash function, it's probably over-optimistic to assume Mallory will never find out which hash function you picked. Once Mallory discovers which hash function you picked, if you allow Mallory to give you lots of data to be inserted into your hash table using that hash function, then you are doomed: Mallory can internally rapidly generate billions of data items, hash them with your hash function to find which data items are likely to collide, and then feed you millions of one-in-a-thousand data items that are likely to collide, leading to lookups that run much slower than O(1).

All the techniques that guarantee "O(1) lookups even in the worst case" avoid this problem by doing a little bit of extra work on each insertion to guarantee that, in the future, every possible lookup can succeed in O(1) time. In particular, we assume (worst case) that Mallory will sooner or later discover which hash function we are using; but he only gets a chance to insert a few data items before we pick a different hash function -- tabulation hashing or some other universal hashing -- one that we specially select such that all the data we have so far can be looked up in 2 or 3 probes -- i.e., O(1). Because we randomly select this function, we can be fairly sure that Mallory won't know what function we picked for a while. Even if Mallory immediately gives us data that, even with this new hash function, collides with previous data, we can then pick yet another a fresh new hash function such that, after rehashing, all the previous data he and everyone else has fed us can now be looked up in 2 or 3 probes in the worst case -- i.e., O(1) lookups in the worst case.

It's fairly easy to randomly select a new hash function and rehash the entire table often enough to guarantee that each lookup is always O(1). While this guarantees that each lookup is always O(1), these techniques, when inserting the Nth item into a hash table that already contains N-1 items, can occasionally require O(N) time for that insert. However, it is possible to design the system such that, even when Mallory deliberately gives you new data that, using the new hash function, collides with previous data, the system can accept lots of items from Mallory and others before it needs to do a full O(N) rebuild. Hash table techniques that pick-a-new-function-and-rehash in order to guarantee O(1) lookups, even in the worst case, include:

cuckoo hashing guarantees that each key lookup succeeds with at most 2 hash calculations and 2 table lookups.
hopscotch hashing guarantees that each key lookup succeeds after inspecting at small number H (perhaps H=32) consecutive entries in the table.
dynamic perfect hashing -- the 1994 paper by Dietzfelbinger is the first one I've read that pointed out that, even though it rehashes "frequently" in order to guarantee that each key lookup always succeeds with 2 hash calculations and 2 lookups, it's possible to do a full rehash so rarely that even though each full rehash uses O(n) time, the expected average cost of insertions and deletion are O(1) amortized.

Data Structures/Hash Tables

Outras dicas

Hash table lookup can always be $O(1)$ for static sets, see the 2002 paper by Arne Andersson and Mikkel Thorup: Dynamic ordered sets with exponential search trees

Firstly, we give the ﬁrst deterministic polynomial-time (in n) algorithm for constructing a linear space static dictionary with $O(1)$ worst-case access cost (cf. perfect hashing). As mentioned earlier, a linear space data structure that supports member queries (neighbor queries are not supported) in constant time can be constructed at a worst-case cost $O (n^2 W)$ without division [30]. We show that the dependency of word size can be removed.

In the general case, Andersson et al provide an algorithm for hash-indexed data structures that support lookups and updates in $O(\sqrt{\log n / \log \log n})$. Furthermore, they prove that this bound is optimal. So we know exactly how close we can get to $O(1)$ in the general case.

I am not a data structures expert, but the usual theoretical approach to hashing is that one defines a family of functions (e.g., $h_{a,b}(x) = ax + b \mod p$) and then considers the behavior on a worst case input on a randomly chosen member of the family, where the adversary doesn't know the random choice in advance. This is similar to how randomized algorithms are analyzed as well: the expectation is taken over the algorithm's choices, not the input distribution.

In the past, according to a Usenix paper by Crosby and Wallach, common programming languages didn't do anything like this, leaving lots of web apps (and other servers) open to a DoS attack based on manufacturing collisions. (The paper is from 2003, but it suggests that Dan Bernstein had discovered the same idea quite a bit earlier.)

A quick google search provides claims that the state of the art in terms of implementations has both improved and not improved.

Another aside is that in a high-bandwidth world, timing attacks make it not-so-hard to find collisions online (as opposed to offline as the Crosby-Wallach link suggests). I seem to remember that Daniel Golovin had results some years ago on data structures that aren't vulnerable to timing attacks, but I don't know if they are widely used.

The average-case analysis for the hash-tables is made under the usual assumption of uniformity of the inputs, which once makes due to occam’s razor.

If you have additional knowledge about the domain and the distribution of the keys you can take the same average-case analysis and replace the uniform distribution with your distribution and recalculate the expectations, at least in theory.

Of course the difficulty stems from the fact that non-uniform avaerage-case analysis’ are hard to do. And your “knowledge” may not be conveniently expressable as a distribution that can be used easily in such an analysis.

Obviously the easiest thing to do are simulations. Implement the hash-tables and oberserve how they perform for your typical set of inputs.

Permutations (of a fixed length), as a specific case of finite known sets: it is relatively easy to assign unique numbers to permutations, as in this paper. I have used this (in a somewhat less ghastly implementation) to map permutations of length $n$ into an array of size $n!$. But I could do this because I was eventually going to need every permutation; if you are only using a subset, you would need a function tailored to that subset or an efficient sparse array.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange