Upper bound for set disjointness under product distributions

https://cs.stackexchange.com/questions/119290

28-09-2020
|

Question

For the set disjointness problem in the 2-party model of communication complexity, Alice is given an input $X$ and Bob is given input $Y$, $X$ and $Y$ are $n$-length bitstrings (sampled from some distribution), which are interpreted as subsets of $[n]$, i.e., $X_i=1$ means the $i$-th bit of $X$ is in the subset. The goal is for Alice and Bob to answer whether the subsets represented by $X$ and $Y$ are disjoint by using as little communication as possible. It is known that $\Omega(n)$ bits are necessary in the worst case for randomized protocols.

I came across this draft of the textbook Communication Complexity by Anup Rao and Amir Yehudayoff, where Exercise 6.8 mentions that 2-party set disjointness can be solved with an expected number of $O(n^{2/3}\log^2 n)$ bits if the inputs of Alice and Bob are sampled independently.

Consider the following protocol. If there is a coordinate $j \in [n]$ such that $H(X_j)$ and $H(Y_j)$ are both at least $\epsilon$, then Alice and Bob communicate $X_j,Y_j$. They condition on the values that they see and repeat this step until no such coordinate can be found. At this point, Alice and Bob use Shannon’s coding theorem to encode $X$, $Y$. Show how to set $\epsilon$ so that the expected communication can bounded by $n^{2/3} \log^2 n$. Hint: Use the fact that $H(X_j ) \ge \epsilon$ implies that $Pr[X_j = 1] \ge \Omega(\epsilon/(\log(1/ε)))$.

I suppose the idea is to first communicate all indices where $X$ and $Y$ have large entropy and then use the fact that the remaining indices must have small entropy. However, the details of the protocol and where the independence of $X$ and $Y$ is coming in, aren't clear to me.

Solution

During the first phase of the algorithm, the players zoom in on a high-entropy coordinate. They exchange $O(1)$ bits, and stop at each step with probability $\Omega(\epsilon^2/\log^2(1/\epsilon))$ (here we use independence). Hence this phase uses up at most $O(\log^2(1/\epsilon)/\epsilon^2)$ bits in expectation. The second phase uses $O(n\epsilon)$ bits. In total, we used this many bits in expectation: $$ O(\log^2 (1/\epsilon)/\epsilon^2 + \epsilon n). $$ Choose $\epsilon = 1/n^{1/3}$ to obtain the stated bound.

Without independence, we don't have any control on the stopping probability at each step of the first phase. For example, it could be that $X$ is sampled uniformly at random, and $Y$ is the negation of $X$. In this case the first phase will never stop, and so the protocol will always use $2n$ bits.

As an aside, the upper bound can be improved to $O(\sqrt{n} \log n)$, which almost matches the lower bound $\Omega(\sqrt{n})$ (the gap might have been closed in the literature). See lecture notes of Prahladh Harsha.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange