How to correctly set Hunpos tagger in NLTK for POS tagging in english?

Question 1

I guess I found a way to do it. For those who were having the same problem, I recommend you to download the source code, build it and call it in a way different from what is described in NLTK docs. As it weren't trivial for me, I'm putting it here step-by-step:

Under Unix:

1) Download Subversion SVN if you don't have it and check out the project source code:

svn checkout http://hunpos.googlecode.com/svn/trunk/ hunpos-read-only

This will create a trunk directory where you checked out.

2) Then, to be able to successfully build it, you might need ocamlbuild for automatic compiling of Objective Caml. sudo apt-get install ocaml-nox should handle this.

3) cd to the trunk directory (where you downloaded Hunpos source code) and do

./build.sh build

4) At this point, you shall have a binary file tagger.native in your trunk directory. Put the whole trunk directory in your /usr/local/bin (you may need to do it as super user).

5) Download the en_wsj.model.gz file here, unzip it and put the en_wsj.model binary also in usr/local/bin.

6) Finally, in your python script, you may create an instance of HunposTagger class passing the paths to both files you have created previously, something very close to:

>>> from nltk.tag.hunpos import HunposTagger
>>> ht = HunposTagger(path_to_model='/usr/local/bin/en_wsj.model', \
                      path_to_bin=  '/usr/local/bin/trunk/tagger.native')
>>> ht.tag('I want to go to San Francisco next year'.split())
[('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('go', 'VB'), ('to', 'TO'),
 ('San', 'NNP'), ('Francisco', 'NNP'), ('next', 'JJ'), ('year', 'NN')]
>>> ht.close()

(Don't forget to close... if you don't like to close, you may use the with statement as well)

7) If you still have some trouble, try to set an environmental variable HUNPOS to /usr/local/bin/trunk. To do this, you may add the following line to your ~/.bashrc (or ~/.bash_profile in MacOS):

export HUNPOS=/usr/local/bin/trunk

and restart your terminal.

That worked for me, but if someone has a better, shorter or simpler way to set this up, please I'd love to hear :)

Question 2

There are pre-compiled versions of hunpos where you can use immediately without compiling from source if you're working on linux.

$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ tar xvfz hunpos-1.0-linux.tgz
$ gunzip en_wsj.model.gz
$ mv en_wsj.model hunpos-1.0-linux
$ python
>>> from nltk.tag import HunposTagger
>>> hpt = HunposTagger('hunpos-1.0-linux/en_wsj.model','hunpos-1.0-linux/hunpos-tag')
>>> hpt.tag('this is a foo bar sentence'.split())
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN'), ('sentence', 'NN')]