Applying Expectation Maximization to coin toss examples

https://cs.stackexchange.com/questions/10637

16-10-2019
|

Pergunta

I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:

From here: There are three coins $c_0$, $c_1$ and $c_2$ with $p_0$, $p_1$ and $p_2$ the respective probability for landing on Head when tossed. Toss $c_0$. If the result is Head, toss $c_1$ three times, else toss $c_2$ three times. The observed data produced by $c_1$ and $c_2$ is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is the result of $c_0$. Estimate $p_0$, $p_1$ and $p_2$.

And from here: There are two coins $c_A$ and $c_B$ with $p_A$ and $p_B$ being the respective probability for landing on Head when tossed. Each round, select one coin at random and toss it ten times; record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate $p_A$ and $p_B$.

While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.

So how exactly do these examples work?

Solução

(This answer uses the second link you gave.)

$\newcommand{\Like}{\text{L}}\newcommand{\E}{\text{E}}$Recall the definition of likelihood: $$\Like[\theta | X] = \Pr[X| \theta] = \sum_Z \Pr[X, Z | \theta]$$ where in our case $\theta = (\theta_A, \theta_B)$ are the estimators for the probability that coins A and B respectively land heads, $X = (X_1, \dotsc, X_5)$ being the outcomes of our experiments, each $X_i$ consisting of 10 flips, and $Z = (Z_1, \dotsc, Z_5)$ being the coin used in each experiment.

We want to find maximum likelihood estimator $\hat{\theta}$. The Expectation-Maximization (EM) algorithm is one such method to find (at least local) $\hat{\theta}$. It works by finding the conditional expectation, which is then used to maximize $\theta$. The idea is that by continually finding a more likely (i.e. more probable) $\theta$ in each iteration we will continually increase $\Pr[X,Z|\theta]$ which in turn, increases the likelihood function. There are three things that need to be done before going forward designing an EM-based algorithm.

Construct the model
Compute Conditional Expectation under the model (E-Step)
Maximize our likelihood by updating our current estimate of $\theta$ (M-Step)

Construct the Model

Before we go further with EM we need to figure out what exactly it is we are computing. In the E-step we are computing exactly the expected value for $\log \Pr[X,Z|\theta]$. So what is this value, really? Observe that $$\begin{align*} \log \Pr[X,Z|\theta] &= \sum_{i=1}^5 \log\sum_{C\in \{A,B\}}\Pr[X_i, Z_i=C| \theta]\\ &=\sum_{i=1}^5 \log\sum_{C\in \{A,B\}} \Pr[Z_i=C | X_i, \theta] \cdot \frac{\Pr[X_i, Z_i=C| \theta]}{\Pr[Z_i=C | X_i, \theta]}\\ &\geq \sum_{i=1}^5 \sum_{C\in \{A,B\}} \Pr[Z_i=C | X_i, \theta] \cdot \log\frac{\Pr[X_i, Z_i=C| \theta]}{\Pr[Z_i=C | X_i, \theta]}. \end{align*}$$ The reason is that we have 5 experiments to account for, and we don't know what coin was used in each. The inequality is due to $\log$ being concave and applying Jensen's inequality. The reason we need that lower bound is that we cannot directly compute the arg max to the original equation. However we can compute it for the final lower bound.

Now what is $\Pr[Z_i=C|X_i,\theta]$? It is the probability that we see coin $C$ given experiment $X_i$ and $\theta$. Using conditional probabilities we have, $$\Pr[Z_i=C| X_i, \theta] = \frac{\Pr[X_i, Z_i = C|\theta]}{\Pr[X_i|\theta]}.$$

While we've made some progress, we're not done with the model just yet. What is the probability that a given coin flipped the sequence $X_i$? Letting $h_i = \#\text{heads in } X_i$ $$\Pr[X_i, Z_i = C| \theta] = \frac{1}{2} \cdot \theta_C^{h_i} (1 - \theta_C)^{10 - h_i},\ \text{ for } \ C \in \{A, B\}.$$ Now $\Pr[X_i|\theta]$ is clearly just the probability under both possibilities of $Z_i=A$ or $Z_i=B$. Since $\Pr[Z_i = A] = \Pr[Z_i = B] = 1/2$ we have, $$\Pr[X_i|\theta] = 1/2 \cdot (\Pr[X_i |Z_i = A, \theta] + \Pr[X_i |Z_i = B, \theta]).$$

E-Step

Okay... that wasn't so fun but we can start doing some EM work now. The EM algorithm begins by making some random guess for $\theta$. In this example we have $\theta^0 = (0.6,0.5)$. We compute $$\Pr[Z_1=A|X_1,\theta] = \frac{1/2 \cdot (0.6^5 \cdot 0.4^5)}{1/2 \cdot ((0.6^5 \cdot 0.4^5) + (0.5^5 \cdot 0.5^5))} \approx 0.45.$$ This value lines up with what is in the paper. Now we can compute the expected number of heads in $X_1 = (H,T,T,T,H,H,T,H,T,H)$ from coin $A$, $$\E[\# \text{heads by coin }A | X_1, \theta] = h_1 \cdot \Pr[Z_1=A|X_1,\theta] = 5 \cdot 0.45 \approx 2.2.$$ Doing the same thing for coin $B$ we get, $$\E[\# \text{heads by coin }B | X_1, \theta] = h_1 \cdot \Pr[Z_1=B|X_1,\theta] = 5 \cdot 0.55 \approx 2.8.$$ We can compute the same for the number of tails by substituting $h_1$ for $10 - h_1$. This continues for all other values of $X_i$ and $h_i$ $1 \leq i \leq 5$. Thanks to linearity of expectation we can figure out $$\E[\#\text{heads by coin } A|X ,\theta] = \sum_{i=1}^5 \E[\# \text{heads by coin }A | X_i, \theta]$$

M-Step

With our expected values in hand, now comes the M step where we want to maximize $\theta$ given our expected values. This is done by simple normalization! $$\theta_A^1 = \frac{E[\#\text{heads over } X \text{ by coin } A|X ,\theta]}{E[\#\text{heads and tails over } X \text{ by coin } A|X ,\theta]} = \frac{21.3}{21.3 + 9.6} \approx 0.71.$$ Likewise for $B$. This process begins again with the E-Step and $\theta^1$ and continues until the values for $\theta$ converge (or to some alloweable threshold). In this example we have 10 iterations and $\hat{\theta} = \theta^{10} = (0.8, 0.52)$. In each iteration the value of $\Pr[X,Z|\theta]$ increases, due to the better estimate of $\theta$.

Now in this case the model was fairly simplistic. Things can get much more complicated pretty quickly, however the EM algorithm will always converge, and will always produce a maxmimum likelihood estimator $\hat{\theta}$. It may be a local estimator, but to get around this we can just restart the EM process with a different initialization. We can do this a constant amount of times and retain the best results (i.e., those with the highest final likelihood).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange