Efficiently selecting a random subset of size $m$ from a set of size $n$

https://cs.stackexchange.com/questions/129753

29-09-2020
|

Question

This is a cross post of my question here on math.se.

I have a list of $n$ items and would like to randomly select an $m$ set from it efficiently (in terms of time complexity). Also, I want all possible subsets to be selected with equal probability. The obvious solution is to pick a random integer from $1$ to $n$ and choose the corresponding element, then repeat $m$ times, not counting the event in which one chooses and already chosen element. This becomes increasingly inefficient as $m$ approaches $n$ so for $m>n/2$ it would make sense to instead to pick a $(n-m)$-set and return its compliment.

For values of $m$ close to $n/2$, a better solution I think would be to consider each of the $n$ elements and decide either to pick that element or discard it, each time updating the probability of picking or discarding depending on the number of elements chosen vs discarded prior. Specifically, the algorithm would go as follows (python):

def randomSubset(n,m):
  L = []
  for i in range(n):
    if uniform(0,1)<m/(n-i): L,m = L+[i],m-1
  return L

However I am concerned that this may not result in each subset being chosen with equal probability.

I have two questions. First, does this algorithm pick subsets with equal probability (if so, I'd like a proof that it does and if not I'd also like a proof that it doesn't). Second, more broadly I would like to know what good solutions exist to this problem. Clearly if $m<<n$ then the first method is better than the second however at some point the second method (if it does in fact work) is better than the first. Moreover, an entirely different approach may be best in general.

Solution

The probability that the element $1$ belongs to a random $m$-subset of an $n$-element set is $m/n$. Therefore you should include $1$ in your subset with probability $m/n$.

If you put $1$ in your subset, then you are left with choosing an $(m-1)$-subset of an $(n-1)$-element set.

If you didn't put $1$ in your subset, then you are left with choosing an $m$-subset of an $(n-1)$-element set.

This means that you have to slightly update your algorithm, replacing $m$ with $m-|L|$.

The resulting algorithm is somewhat similar to reservoir sampling.

A third approach, with some similarities, is generating a random permutation of $1,\ldots,n$ and selecting the first $m$ entries.

The downside of all of these approaches are that they run in time $\Theta(n)$, whereas for $m \ll \sqrt{n}$, your first algorithm runs in (expected) time $\tilde\Theta(m)$.

We can improve on the $\Theta(n)$ running time as follows. We will generate a random ordered $m$-subset given $m$ indices $i_1,\ldots,i_m$, where $i_j \in \{1,\ldots,n-(j-1)\}$. The $j$'th element in the subset will be the $i_j$'th smallest number in $\{1,\ldots,n\}$ out of the numbers not already chosen.

In order to complete the description of the algorithm, we need to solve the following problem: given $S \subseteq \{1,\ldots,n\}$ and $i$, find the $i$'th smallest element in $\overline{S}$. We can assume that $S$ is stored in a structure (such as a self-balancing binary tree) which can efficiently answer the following type of query: given $x$, how many elements in $S$ are smaller than $x$. We can then find the $i$'th smallest number in $\overline{S}$ using binary search.

Overall, this algorithm runs in $\tilde\Theta(m)$ for all values of $m$, where the tilde hides factors logarithmic in $n$. (When $m \ll \sqrt{n}$ we can use your first approach, thus getting rid of this dependence on $n$.)

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange