How should I sample from a mixture distribution?

https://datascience.stackexchange.com/questions/81872

14-12-2020
|

Question

Let's say we have a mixture distribution, defined by density $f(x)= w_1 p_1(x) + w_2 p_2(x)$, where $w_i$ is a scalar weight. Furthermore, we have efficient methods to evaluate the pdf and cdf/icdf for the distribution $D_i$ corresponding to density $p_i$. I would like to sample from such a distribution.

The method I currently employ is an implementation of rejection sampling. I construct a proposal function $M*u(x)$, where $u \sim \text{Unif}(lb,ub)$ ($lb,ub$ are constructed such that at least 99% of the cdf of each $D_i$ is contained within, using icdf) and $M$ is constructed by finding the $\max_{x \in [lb,ub],i} p_i(x)$. Because such proposal function envelopes $f$, I am able to sample $f$ by choosing $x \in X \sim \text{Unif}(lb,ub) \times \text{Unif}(0,M)$ and rejecting if $\pi_2 x > f(x)$ [$\pi_2 x$ being the second coordinate of $x$].

However, doing this is quite slow. Not only is the proposal function inopportune (it is scaled uniform, which likely will lead to many rejections), but the construction of $M$ is very slow, as maximizing a function is a non-trivial task. Is there a more efficient way to sample such a distribution? I had considered icdf sampling, but constructing the icdf for $f$ seems non-trivially difficult. Is this impression incorrect? Or perhaps is there some other effective method? If it is helpful, I am implementing this in python and am currently using the scipy and pytorch libraries.

Solution

The mixture distribution can be obtained in the following way.

Let $f(x)=w_1p_1(x) + w_2p_2(x) + ... + w_np_n(x)$, where $p_i$ are density functions and $w_i>0$. Note that $f(x)$ is a density function if the sum of all weights is one.

Then, we use the following two-stage process.

Stage 1. Draw a random variable $X$ (selector, if I remember correctly), such that $P(X=i)=w_i$ for $i=1,2,...,n$.

Stage 2. Return a random variable drawn accordingly to $p_X$, where $X$ is the index obtained in the stage 1.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange