Question

I want to calculate the probability of characters occurring in a string. For example given a string "test", I want to get P(test).

P(test) = p(t) * p(e|t) * p (s|te) * p(t|es)

I have calculated the various bi-gram frequencies of more than 100k strings and calculated the probabilities of their occurrence. My question is, by just multiplying the probabilities of n-grams in a string will I get an accurate answer or is there a better away for finding the same?

Any help is highly appreciated.

Was it helpful?

Solution

Using bi-grams, your answer will be as accurate as you can get with a second-order Markov chain. The results are surprisingly good for such a simple model, but of course you can do even better with more expressive models. For instance, in language modeling, Hidden Markov Models (HMMs) are very often used.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top