The EM Algorithm

In the world of model-based inference, unobserved parameters give rise to observed data. There are a number of ways of estimating these parameters; among the most widely used is the Expectation Maximization, or EM algorithm.

The model

Let's consider the scenario in which there is one underlying distribution for the observed data. If $X$, our data, consists of $N$ i.i.d observations, then estimating the parameters of the underlying distribution is (in general) straightforward. The likelihood ouf our data $X$, given the parameters $\theta$, is

$$L(\theta,X)=p(X|\theta )= \prod_{n=1}^{N} p(x_n|\theta)$$

But what if instead of one distribution, $X$ is drawn from one of two normal distributions, based on some unobserved indicator $Z$. We'll (arbitrarily) let $Z_n=1$ if the $n$-th sample came from one of the distributions, and $Z_n=0$ if it came from the other. Let $\theta_1$, and $\theta_0$ be the parameters of the respective distributions, and let $\pi$ be $p(Z_n=1)$.

The "Full-data" likelihood

If we happened to observed $Z$ along with $X$, then it would be straightforward to write down the likelihood of $X$ and $Z$ given $\theta_0$, $\theta_1$ and $\pi$.

$$p(x,z|\theta_1,\theta_0,\pi)=\prod_{n=1}^{n=N}z_n p(x_n|\theta_1) \pi +(1-z_n)p(x_n|\theta_0)(1-\pi)$$

In practice, one of the two terms will always be zero, because a particular $x_n$ can come from one of the two distributions, to denote this, we'll let $\theta_{z_n}$ denote $\theta_1$ if $z_n=1$, and $\theta_0$ if $z_n=0$ (we can do the same for $\pi_{z_n}$) This means we can rewrite the likelihood as

$$p(x,z|\theta_1,\theta_0,\pi)=\prod_{n=1}^{N}p(x_n|\theta_{z_n}) \pi_{z_n}$$

The log-likelihood is then:
$$\sum_{n=1}^{n=N} log(p(x_n|\theta_{z_n}))+log(\pi_{z_n}) $$

To find the maximum likelihood estimate of $\theta_1$, we take the derivative of the log-likelihood, set it to zero, and solve for $\theta_1$ (technically, to be sure you're at a maximum, you also have to ensure that the second derivative is negative, but we'll ignore that step here).

Marginal Likelihood

If we didn't observe $Z$, the full-data likelihood isn't very useful to us. Another option would be for us to marginalize over possible configurations of $Z$ to obtain $p(X|\theta_0,\theta_1,\pi)$

$$p(X|\theta_0,\theta_1,\pi)=\sum_{z} p(X,z|\theta_1,\theta_2,\pi)=\sum_{z} \prod_{n=1}^{N} p(x_n|\theta_{z_n}) \pi_{z_n} $$

Unfortunately, there are $2^N$ possible configurations of $Z$ to sum over, which is like, a lot. To make things worse, we can't even consider the log of the marginal likelihood because of the summation over $Z$, so we'd definitely run in to underflow issues.

The E-M Algorithm

The Expected Complete Log-Likelihood

The EM algorithm gets around this by computing the "expected complete data likelihood" instead of either the marginal or complete likelihood. To do that, we're going to replace $z_n$ with:
These take a value between 0 and 1 based on the probability that $x_n$ is from distribution 1.
Plugging $\mu_n$ into the complete log-likelihood gives us the expected complete log-likelihood:

$$E_{\mu}[\log(p(X,Z|\theta_1,\theta_0,\pi))]=\sum_{n=1}^N \mu_n log( \pi p(x_n|\theta_1))+(1-\mu_n)log((1-\pi)p(x_n|\theta_0)) $$

We can then find the maximum likelihood for $\theta_0$ ,$\theta_1$, and $\pi$, then recompute the expectation of $Z$, rinse and repeat until convergence. That the E-M algorithm monotonically increases in likelihood with successive iterations is given by something called Jensen's inequality, which is beyond the scope of this post.