Probing hash tables

https://stackoverflow.com/questions/15314751

21-03-2022
|

Question

I am implementing a hash table for a project, using 3 different kinds of probing. Right now I'm working on linear.

For linear probing, I understand how the probing works, and my instructor implied he wanted the step size to be 1. The thing is, no duplicates are allowed. So I have to "search" for a value before I insert it, right? But what if the table is used to the point where all the cells are either "occupied" or "deleted"? Then in order to search for a specific key to make sure it isn't in the table, I'll have to search the entire table. That means a search operation (and by extension, an insert operation) is O(n).

That doesn't seem right, and I think I misunderstood something.

I know I won't have to run into the same issue with quadratic probing, since the table needs to be at least half empty, and it will only probe a determined number of elements. And for double hashing, I'm not sure how this will work, because I'll also need to search the table to prove that the key to be inserted isn't present. But how would I know how to stop the search if none of the cells are "never occupied?"

So: In open hashing, where every entry in the table has been occupied in the past, does it take O(n) probes to search for an element (and insert, if no duplicates are allowed)?

La solution

If you misunderstand this aspect of linear probing, so do I. I agree that if the hash table is near full then performance degrades towards O(n) per insertion. See Don Knuth's 1963 analysis for all the details.

Parenthetically, I was amazed to see that the first analysis of this problem was actually done by the mathematician Ramanujan in 1913, whose results implied "that the total displacement of elements, i.e., the construction cost, for a linear probing hashing table that is full is about N^(3/2)." (see here)

In practice, however, I don't think the problem that insertion is slow is the important problem with nearly-full hash tables. The important problem is that you get to the point where you can't do another insertion at all!

Thus, any practical implementation of a hash table must have a strategy for re-sizing the hash table when it gets beyond a given load factor, with the optimal load factor for re-sizing chosen either based on theory or experiment. Using experiments is particularly valuable in this case because the performance of linear hashing can be very sensitive to the ability of the hash function to distribute items evenly across the hash table in a way that avoids clusters, which makes performance very dependent on the characteristics of the items to be inserted into the table.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow