Maximum coverage 1/2-approximation algorithm: why does the central lemma hold?

https://cs.stackexchange.com/questions/124664

29-09-2020
|

Pergunta

I am looking for an approximation algorithm for the Maximum Coverage problem and a proof of its approximation ratio. As approximation algorithm I use the greedy algorithm which chooses the set that maximizes the number of new elements at each turn (same as the one proposed in the Wikipedia article). However it is not as trivial to reach a result about the approximation ratio of this greedy approach.

So I did some research and found these lecture notes that describe some useful concepts for this problem and also proves a lemma. Now let me iterate what is said in these lecture notes before we get to the lemma.

Maximum Coverage problem Given a set of sets $U=\{S_{1},\dots,S_{m}\}$ and a number $k$, choose (at most) $k$ sets such that the union of the sets is maximized, or $\max_{I\subseteq U}\vert \cup_{j\in I}S_{j}\vert$ for some index set $I$. The sets may have elements in common. Let $\mathbf{OPT}$ denote the optimal solution, i.e. numbers at most covered by $k$ sets, let $a_{i}$ denote the number of newly covered elements in the $i$:th iteration by the greedy algorithm, let $b_{i}$ be the total number of covered elements in the $i$:th iteration, i.e. $b_{i}=\sum_{j=1}^{i}a_{i}$, and let $c_{i}$ be the number of uncovered elements at the $i$:th iteration, i.e. $c_{i}=\mathbf{OPT}-b_{i}$. Moreover, $a_{0}=0$, $b_{0}=0$ and $c_{0}=\mathbf{OPT}$.

Lemma: The number of newly covered elements at the $(i+1)$:th iteration (i.e. $a_{i+1}$) is always greater than or equal to $(1/k)\cdot$number of uncovered elements at the $i$:th iteration, or $a_{i+1}\geq c_{i}/k$.

Proof: Optimal solution covers $\mathbf{OPT}$ elements at $k$ iterations. That means, at each iteration there should be some sets in $U$ whose size is greater than or equal to the $(1/k)$ of the remaining uncovered elements, i.e., $c_{i}/k$. Otherwise, it was impossible to cover $\mathbf{OPT}$ many elements at $k$ steps by the optimal solution. Since the approximation algorithm is greedy, i.e., choosing always the set covering maximum number of uncovered elements,the chosen set at each iteration should be at least the $(1/k)$ of the remaining uncovered elements. That is, $a_{i+1}\geq c_{i}/k$.

I do not understand this proof, why does it have to be the case there exist such sets? And why is the size of the set important, is it not the number of new elements the set contributes that is important? It all seems arbitrary to me to choose that there is a lower bound of $(1/k)\cdot c_{i}$ for $a_{i+1}$. I am also confused about how the optimal solution is related to the greedy solution, what is the rationale behind the statement that if this is not true then it would have been impossible for the optimal solution to be $\mathbf{OPT}$ in $k$ iterations? Since I have trouble understanding at all, is there any other way to explain this lemma?

Solução

Let $S_{O_1},\ldots,S_{O_k}$ be an optimal solution, and let $O = S_{O_1} \cup \cdots \cup S_{O_k}$.

At the $i$'th iteration, suppose that the elements covered by the sets chosen the algorithm form the set $T$. Let $S'_{O_i} = S_{O_i} \setminus T$ and $O' = O \setminus T$, that is, we have removed from the optimal solution all elements already covered. Since $$ |O'| = |S'_{O_1} \cup \cdots \cup S'_{O_k}| \leq |S'_{O_1}| + \cdots + |S'_{O_k}| \leq k \max_{i=1}^k |S'_{O_i}|, $$ we see that among the sets $S_{O_1},\ldots,S_{O_k}$, at least one (the one maximizing $\max_i |S'_{O_i}|$ covers at least $|O'|/k$ uncovered elements.

The greedy algorithm chooses a set covering at least as many uncovered elements. This allows us to obtain a lower bound on its approximation ratio, as follows. Let $T_i$ be the union of the sets chosen in the first $i$ steps of the greedy algorithm. The starting point is $T_0 = \emptyset$. At step $i$, the argument above shows that we cover at least $|O'|/k$ new elements. Since $|O'| = |O \setminus T_{i-1}| \geq |O| - |T_{i-1}|$, this shows that $$ |T_i| \geq |T_{i-1}| + (|O| - |T_{i-1}|)/k = \left(1 - \frac{1}{k}\right) |T_{i-1}| + \frac{|O|}{k}. $$ In particular, $$ \begin{align} &|T_1| \geq \frac{|O|}{k} \\ &|T_2| \geq \frac{|O|}{k} + \left(1 - \frac{1}{k}\right) \frac{|O|}{k} \\ &|T_3| \geq \frac{|O|}{k} + \left(1 - \frac{1}{k}\right) \frac{|O|}{k} + \left(1 - \frac{1}{k}\right)^2 \frac{|O|}{k} \end{align} $$ and so on. At the end of the algorithm we get a set $T_k$ satisfying $$ \begin{align} |T_k| &\geq \frac{|O|}{k} \left[1 + \left(1 - \frac{1}{k}\right) + \cdots + \left(1 - \frac{1}{k}\right)^{k-1}\right] \\ &= \frac{|O|}{k} \frac{1 - \left(1 - \frac{1}{k}\right)^k}{1 - \left(1 - \frac{1}{k}\right)} \\ &= \left(1 - \left(1 - \frac{1}{k}\right)^k\right) |O|. \end{align} $$ Now it's time for some calculus: $$ \ln \left(1 - \frac{1}{k}\right) = - \frac{1}{k} - \frac{1}{2k^2} - \cdots \leq -\frac{1}{k}, $$ and so $$ \left(1 - \frac{1}{k}\right)^k = \exp\left[k\ln \left(1 - \frac{1}{k}\right)\right] \leq \exp (-1) = 1/e, $$ implying that $$ |T_k| \geq \left(1 - \frac{1}{e}\right) |O|. $$ In other words, the greedy algorithm gives a $(1-1/e)$-approximation (this is better than your claimed $1/2$). It turns out that $1-1/e$ is tight for the greedy algorithm, and moreover, then $1-1/e$ cannot be improved by a polynomial time algorithm unless $\mathsf{P} = \mathsf{NP}$ (and unconditionally in the value oracle model).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange