Question

I am working on a project, a part of which is to develop an unsupervised HMM trainer for POS tagging, which I now want to test for posible bugs.

I am using Baum-Welch algorithm to train the model. The inputs are sequence words (drawn from a corpus) and the outputs are sequence of hidden states from a set of states (s1, s2, ... sn). I am now done with the coding, but I am not sure if it is bug free.

Can anyone suggest me some debugging ideas? As in what should I check in the outputs? How to check the accuracy of my algorithm ?

Was it helpful?

Solution

Unsupervised POS tagging is a very interesting emerging research topic. If I understand correctly, you are actually asking how to evaluate your tagging accuracy, not how to debug the code. Evaluation is a known issue in unsupervised POS induction. The short answer to your question is: get this annotated corpus from NLTK, then map your states to the corpus tags by mapping a state to the tag it most often co-occurs with, and find the percentage of correct ones. This evaluation procedure is called Many-to-one mapping.

You should make yourself familiar with the literature, as it will answer your questions and more. Here are some places to start:

  • An early paper:

    Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 296–305.

  • A survey paper:

    Christos Christodoulopoulos, Sharon Goldwater and Mark Steedman. 2010. Two Decades of Unsupervised POS induction: How far have we come? In Proceedings of EMNLP 2010.

When you say "unsupervised", you should ask yourself whether you want to use only raw text, or also want to use a dictionary, for example. There are works on that, too.

Also, there is code available out there for the task.

Another place to ask about NLP is: http://metaoptimize.com/qa .

If you have other questions, don't hesitate to ask.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top