Hidden Markov Model Algorithms

Markov Models

Transition probabilities are written as:
$$a_{st} = P(x_i = t | x_{i-1} =s)$$

We can write the probability of any sequence as :
$$P(x) = P(x_L,x_{L-1},...,x_1) = P(x_L|x_{L-1},...,x_1)P(x_{L-1}|x_{L-2},...,x_1)...P(x_1) $$

We get this from the fact that $P(X,Y)=P(X|Y)P(Y)$, and we apply this property recursively. Because Markov models only depend on the previous state, we can vastly simplify this to:
$$P(x) = P(x_1) \prod_{i=2}^L a_{x_{i-1}x_i} $$

Hidden Markov Models

Let us call the state sequence the path, $\pi$. The $i$th state in this path is called $\pi_i$, and is characterised by the parameters:

$$ a_{kl} = P( \pi_i= l |\pi_{i-1} = k) $$. To model emission from states, we will define emission probability:
$$ e_k(b) = P(x_i=b|\pi_i=k) $$. This is the probability that observed symbol $b$ is emitted when in the hidden state $k$. We can now write down the joint probability that a sequence $x$ is observed from a state sequence $\pi$:

$$ P(x,\pi) = a_0\pi_1 \prod_{i=1}^L e_{pi_i}(x_i)a_pi_i\pi_{i+1} $$

The Viterbi Algorithm

How do we find the most probably path through the unobserved states? We can use the dynamic progamming algorithm known as the Viterbi algorithm. Let's define $x_L$ to be our vector (or sequence) of observed (or emitted) states, and $y_L$ the vector (or path or sequence) of unobserved states that emitted our observed states. For the first position in the sequences, $y_0$ and $x_0$, the probability that the unobserved state is $k$ and the observed state is $x_0$ is simply the probability that the "start state" transitioned to $k$ times the probability that $k$ emitted $x_0$:

$$P(y_0=k,x_0) = P(y_0=k) * p(x_0|y_0=k) = a_{0,k}*e_{k,x_0} $$

We can find the value of $k$ that maximizes this probability to get the most likely $k$ for the first position in the sequence. Put another way, this is the most likely path of the first length 1 subsequence. We'll denote this as $V_{0,k}$
The most likely state of the next element in the sequence, (which would then define for us the most likely length $t$ subsequence, $V_{t,k}$)is the state that maximizes the function

$$v_l(i+1) = e_l(x_{i+1})max_k(v_k(i)a_{kl})$$

The basis of the viterbi algorithm is this

  1. Calculate the probability that the first state is $k$ for all $k$
  2. To find the next ($t$) state, choose the previous state such that it maximizes the product of the probability of the path up to $t-1$, the probability of transition from the $t-1$ state to the current state, and emission of the current observed state from the unobserved state
    3)Choose the final state (and thus the sequnce) to optimize the entire path.

The Forward Algorithm

Whereas the Viterbi algorithm gives us the most probable path through the system, the forward algorithm computes the joint probability $P(x_t,y_{1:t})$, and is structurally similar to the Viterbi algorithim. The algorithm is defined by the recurrence relationship:

$$f_l(i+1)=e_l(x_{i+1}) \sum_k f_k(i)a_{kl} $$

Backward algorithm and posterior state probabilities

Now that we've calculated the most probable path, as well as the joint probability of an HMM, let's try to figure out the most probable state for a particular observation. Specifically, let's figure out how to calculate that a particular state emitted our given observation, or $P(\pi _ i=k|x)$ . This is also known as the posterior probability of state k at time $i when the emitted sequence is known.

We calculate the probability of producing the entire observed sequence with the $i$th observed symbol being emitted from state $k$:

$$ P(x,\pi _i = k) = P(x _1...x _i,\pi _i=k) P(x _{i+1}...x _L | x _1 ... x _i, \pi _i=k) = P( x _1 ... x _i, \pi _i = k) P( x _{i+1} ... x _L | \pi _i = k) $$

The first term we recognize as the joint probability, and the second term we'll call $b _k (i)$

The algorithm we're going to use is very similar to the forward algorithm, except we're going to start from the end, and work our way back (recursively).

  1. Initialise the vector $b _L $ (length $K$), as the transition from $k$ to $0$.
  2. the vector $b _{i} $ is recursively computed:
    $$ \sum _l a _ {kl} e _l(x _{i+1}) b _l (i+1)
  3. The $P(x)$ is either computed as :
    $$ P(x) = \sum _l a _{0l} e _l(x _1) b _l (1) $$
    or using the forward algorithm.

We could also write the probability of an the entire observed sequence with the $i$th symbol being produced by state $k$ as:

$$P(\pi _i = k|x) = \frac{ f _k(i) b _k(i)}{P(x)} $$