Question

I just watched a video where they used Viterbi algorithm to determine whether certain words in a sentence are intended to be nouns/verbs/adjs etc, they used transition and emission probabilities, for example the probability of the word 'Time' being used as a verb is known (emission) and the probability of a noun leading onto a verb (transition).

http://www.youtube.com/watch?v=O_q82UMtjoM&feature=relmfu (The video)

How can I find a good dataset of transition and emission probabilities for this use-case?

Or EVEN just a single example with all the probabilities displayed, I want to use realistic numbers in a demonstration.

Was it helpful?

Solution

Usually, implementations of Hidden Markov Models (HMMs) cannot only perform the Viterbi algorithm for tagging, but also an algorithm used to train the model (e.g. the Baum-Welch algorithm). Then the way to obtain the model (i.e. the set of transition and emission probabilities) is to run the training algorithm on a suitable training corpus (such as the PennTreebank).

I am not aware of any freely available, off-the-shelf HMM-based implementation of a POS tagger that comes with a pre-trained model that can be readily inspected. However, an approach that is in many ways similar to an HMM is the Conditional Random Field (CRF). The CRFTagger created at Tohoku University, Japan, appears to come with a pre-trained model for English (see the file model/model.txt after downloading and unpacking). The file is human-readable, but to understand the details of the format you might have to contact the authors.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top