Difficulty in understanding few steps in the proof: “The class $\mathscr{H}_{p,m}$ of hash functions is universal”

https://cs.stackexchange.com/questions/127691

29-09-2020
|

Question

I was going through the text Introduction to Algorithms by Cormen et. al. where I came across the following excerpt regarding the said proof and the steps where I felt difficulty are marked with $\dagger$ and $\dagger\dagger$ respectively.

Designing a universal class of hash functions

$p$ is a prime number large enough so that every possible key $k$ is in the range $0$ to $p — 1$, inclusive. Let $Z_p$ denote the set $\{0, 1,..., p — 1\}$, and let $Z_p^*$ denote the set $\{1, 2,..., p — 1\}$.Because of the assumption that the size of the universe of keys is greater than the number of slots $m$ in the hash table, we have $p > m$.

We now define the hash function $h_{a,b}$ for any $a \in Z_p^*$ and any $b \in Z_p$ :

$h_{a,b} = ((a.k + b) \mod p) \mod m $.

The family of all such hash functions:

$$\mathscr{H}_{p,m}=\{h_{a,b}: a \in Z_p^* , b \in Z_p\}$$

Theorem: The class $\mathscr{H}_{p,m}$ of hash functions is universal.

Proof:

Consider two distinct keys $k$ and $l$ from $Z_p$, so that $k \neq l$. For a given hash function $h_{a,b}$ we let

$$r = (ak + b) \mod p$$ ,

$$s = (al + b) \mod p $$.

We first note that $r\neq s$. Why? Observe that

$$r — s = a(k — l) (\mod p)$$ .

It follows that $г \neq s$ because $p$ is prime and both $a$ and $(k — l)$ are nonzero modulo $p$, and so their product must also be nonzero modulo $p$

Therefore, during the computation of any $h_{a,b}$ in $\mathscr{H}_{p,m}$, distinct inputs $k$ and $l$ map to distinct values $r$ and $s$ modulo $p$; there are no collisions yet at the "mod p level." Moreover, each of the possible $p(p — 1)$ choices for the pair $(a, b)$ with $а \neq 0$ yields a different resulting pair $(r, s)$ with $r \neq s$, since we can solve for $a$ and $b$ given $r$ and $s$$^\dagger$:

$$a = ((r — s)((k — l)^{-1}\mod p)) \mod p $$,

$$b = (r — ak) \mod p$$ ,

where $((k — l)^{-1} \mod p)$ denotes the unique multiplicative inverse, modulo p, of $k — l$. Since there are only $p(p — 1)$ possible pairs $(r, s)$ with $г \neq s$, there is a one-to-one correspondence between pairs $(a, b)$ with $a \neq 0$ and pairs $(r, s)$ with $r \neq s$. Thus, for any given pair of inputs $k$ and $l$, if we pick $(a, b)$ uniformly at random from $Z_p^* \times Z_p$, the resulting pair $(r, s)$ is equally likely to be any pair of distinct values modulo p.

It then follows that the probability that distinct keys $k$ and $l$ collide is equal to the probability that $r \equiv s (\mod m)$ when $r$ and $s$ are randomly chosen as distinct values modulo $p$. For a given value of $r$, of the $p — 1$ possible remaining values for $s$, the number of values $s$ such that $s \neq r$ and $s \equiv r (\mod m)$ is at most$^{\dagger\dagger}$

$$\lceil p/m \rceil - 1 < ((p + m - 1)/m) - 1$$ $$ =(p-1)/m$$.

The probability that $s$ collides with $r$ when reduced modulo $m$ is at most $((p - l)/m)/(p - 1) = 1/m$.

Therefore, for any pair of distinct values $k,l \in Z_p$,

$$Pr\{h_{a,b}(k)=h_{a,b}(l)\}\leq 1/m$$

so that $\mathscr{H}_{p,m}$ is indeed universal.

Doubts:

I could not understand the following statements in the proof:

$\dagger$ :Each of the possible $p(p — 1)$ choices for the pair $(a, b)$ with $а \neq 0$ yields a different resulting pair $(r, s)$ with $r \neq s$, since we can solve for $a$ and $b$ given $r$ and $s$

why , "we can solve for $a$ and $b$ given $r$ and $s$" $\implies$ "Each of the possible $p(p — 1)$ choices for the pair $(a, b)$ with $а \neq 0$ yields a different resulting pair $(r, s)$ with $г \neq s$"

$\dagger\dagger$ : For a given value of $r$, of the $p — 1$ possible remaining values for $s$, the number of values $s$ such that $s \neq r$ and $s \equiv r (\mod m)$ is at most $\lceil p/m \rceil - 1 $ .

How do we get the term $\lceil p/m \rceil - 1 $ ?

Solution

We want to show that if $k_1\neq k_2\in\mathbb{Z}_p$ then

$\Pr\limits_{(a,b)\in\mathbb{Z}_p^*\times\mathbb{Z}_p}[ak_1+b\equiv ak_2+b \pmod m]\le\frac{1}{m}$.

Where both addition and multiplication are preformed in $\mathbb{Z}_p$.

We start by showing that if $a\sim U(Z_p^*)$ and $b\sim U(Z_p)$ then for all $k_1\neq k_2\in \mathbb{Z}_p$, $(ak_1+b,ak_2+b)$ is uniformly distributed over $\{(x,y)\in\mathbb{Z}_p^2| x\neq y\}$ (i.e. $h(k_1)$ and $h(k_2)$ are jointly uniform over pairs with different entries, where the randomness is over the choice of $h$). This is immediate from the fact that for all $(c_1,c_2)\in\mathbb{Z}_p^2$ with $c_1\neq c_2$, the following system of linear equations:

$ \begin{align*} & ak_1+b=c_1 \\ & ak_2+b=c_2 \end{align*} $

has a unique solution over the variables $(a,b)\in Z_p^*\times\mathbb{Z}_p$. Subtracting the second equation from the first yields $a(k_1-k_2)=c_1-c_2$, since $k_1-k_2$ is nonzero we can multiply both sides by its inverse and obtain $a=(k_1-k_2)^{-1}(c_1-c_2)$. If $c_1\neq c_2$, then this is a nonzero solution for $a$, and we can extract $b$ from any of the two equations. Thus, for each pair $(c_1,c_2)$ with $c_1\neq c_2$ there are unique $(a,b)\in Z_p^*\times\mathbb{Z}_p$ such that $\big(h_{a,b}(k_1),h_{a,b}(k_2)\big)=(c_1,c_2)$. This settles your first question.

Now, divide $\mathbb{Z}_p$ into $\lceil p/m\rceil$ buckets, $b_1,...,b_{l=\lceil p/m\rceil}$ as follows: $b_1=\{0,1,...,m-1\}, b_2=\{m,m+1,...,2m-1\}$,...,$b_l=\{m\lfloor p/m\rfloor, m\lceil p/m\rceil+1,...,p-1\}$. Note that each bucket except the last is of size $m$, and no two elements in the same bucket are equivalent modulo $m$. We conclude that the number of different pairs in $\{(x,y)\in\mathbb{Z}_p^2| x\neq y\}$ that are equivalent modulo $m$ is at most $p(\lceil p/m\rceil-1)$, since after choosing the first element, you are left with $\lceil p/m\rceil-1$ elements to choose from (you must pick a different bucket and each bucket provides at most one candidate). Recall that $\big(h_{a,b}(k_1),h_{a,b}(k_2)\big)\sim U(\{(x,y)\in\mathbb{Z}_p^2| x\neq y\})$, so we can finally conclude:

$\Pr\limits_{(a,b)\in\mathbb{Z}_p^*\times\mathbb{Z}_p}\left[h_{a,b}(k_1)=h_{a,b}(k_2)\pmod m\right]=\frac{p(\lceil p/m\rceil-1)}{p(p-1)}\le \frac{1}{m}$

Note that allowing $a$ to take the value $0$ only makes our analysis easier, since now $\big(h(k_1),h(k_2)\big)$ is jointly uniform over $\mathbb{Z}_p^2$, but there is an additionaly probability of $\frac{1}{p}$ that $a=0$ and our hashes will be equivalent modulo $m$, so in this case we will have to settle for an $O(\frac{1}{m})$ bound on the collision probability.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange