Probability of selecting a particular set, by sampling without replacement from a categorical distribution
-
29-09-2020 - |
Question
Suppose I have a categorical distribution on items $1,\dots,n$, that assigns probability $p_i$ to item $i$. I now repeatedly sample from this distribution, until I have obtained $k$ unique objects. I'd like to compute the probability that the set of objects obtained is exactly $\{1,\dots,k\}$.
Is there an efficient way to compute this probability, given $p_1,\dots,p_n$ and $k$?
I can see that the probability has the form
$$p = \sum_\sigma \prod_{i=1}^k {p_{\sigma(i)} \over (1-p_{\sigma(1)}) \cdots (1-p_{\sigma(1)}-\dots-p_{\sigma(i-1)})},$$ where the sum is over all permutations $\sigma \in S_k$ on $\{1,\dots,k\}$. (Here $\sigma$ represents the order in which the items $1,\dots,k$ are selected.) However, this formula for the probability involves $k!$ terms, so computing the probability in this way would take time exponential in $k$. Is there a more efficient way to compute it?
Of course, without loss of generality we can assume $n=k+1$.
Solution
For each $\Sigma \subseteq [k+1]$, you can compute the probability $q(\Sigma)$ that the first $|\Sigma|$ elements to appear are $\Sigma$ using the following recurrence: $q(\emptyset) = 1$ and when $\Sigma \neq \emptyset$, $$ q(\Sigma) = \sum_{\sigma \in \Sigma} q(\Sigma-\sigma) \frac{p_\sigma}{p_\sigma + \sum_{\tau \notin \Sigma} p_\tau}. $$ You are interested in $q([k])$. The total computation time is $O(k2^k)$ (ignoring arithmetic), if you compute the sum in the denominator in tandem. Perhaps this could be improved to $O(2^k)$.