Pergunta

I want to calculate the probability of characters occurring in a string. For example given a string "test", I want to get P(test).

P(test) = p(t) * p(e|t) * p (s|te) * p(t|es)

I have calculated the various bi-gram frequencies of more than 100k strings and calculated the probabilities of their occurrence. My question is, by just multiplying the probabilities of n-grams in a string will I get an accurate answer or is there a better away for finding the same?

Any help is highly appreciated.

Foi útil?

Solução

Using bi-grams, your answer will be as accurate as you can get with a second-order Markov chain. The results are surprisingly good for such a simple model, but of course you can do even better with more expressive models. For instance, in language modeling, Hidden Markov Models (HMMs) are very often used.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top