Question

I understand how a Hidden Markov Model is used in genomic sequences, such as finding a gene. But I don't understand how to come up with a particular Markov model. I mean, how many states should the model have? How many possible transitions? Should the model have a loop?

How would they know that their model is optimal?

Do they imagine, say 10 different models, benchmark those 10 models and publish the best one?

Was it helpful?

Solution

I'm familiar with three main approaches:

  1. A priori. You might know that there are four base pairs to pick from, and so allow the HMM to have four states. Or you might know that English has 44 phonemes, and so have 44 states for the hidden phoneme layer in a voice recognition model.

  2. Estimation. The number of states can often be estimated beforehand, perhaps by simple clustering on the observed features of the HMM. If the HMM transition matrix is triangular (which is often the case in failure prediction), the number of states determines the shape of the distribution of total time from the start state to the end state.

  3. Optimization. Like you suggest, either many models are created and fit and the best model selected. One could also adapt the methodology that learns the HMM to allow the model to add or discard states as needed.

OTHER TIPS

A further approach is to sample from models with a countably infinite number of states. The answer to 'how many' would then emerge by averaging over the samplers.

http://mlg.eng.cam.ac.uk/zoubin/papers/ihmm.pdf

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top