I guess I found a way to do it. For those who were having the same problem, I recommend you to download the source code, build it and call it in a way different from what is described in NLTK docs. As it weren't trivial for me, I'm putting it here step-by-step:
Under Unix:
1) Download Subversion SVN if you don't have it and check out the project source code:
svn checkout http://hunpos.googlecode.com/svn/trunk/ hunpos-read-only
This will create a trunk
directory where you checked out.
2) Then, to be able to successfully build it, you might need ocamlbuild
for automatic compiling of Objective Caml. sudo apt-get install ocaml-nox
should handle this.
3) cd
to the trunk
directory (where you downloaded Hunpos source code) and do
./build.sh build
4) At this point, you shall have a binary file tagger.native
in your trunk
directory. Put the whole trunk
directory in your /usr/local/bin
(you may need to do it as super user).
5) Download the en_wsj.model.gz
file here, unzip it and put the en_wsj.model
binary also in usr/local/bin
.
6) Finally, in your python script, you may create an instance of HunposTagger
class passing the paths to both files you have created previously, something very close to:
>>> from nltk.tag.hunpos import HunposTagger
>>> ht = HunposTagger(path_to_model='/usr/local/bin/en_wsj.model', \
path_to_bin= '/usr/local/bin/trunk/tagger.native')
>>> ht.tag('I want to go to San Francisco next year'.split())
[('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('go', 'VB'), ('to', 'TO'),
('San', 'NNP'), ('Francisco', 'NNP'), ('next', 'JJ'), ('year', 'NN')]
>>> ht.close()
(Don't forget to close... if you don't like to close, you may use the with
statement as well)
7) If you still have some trouble, try to set an environmental variable HUNPOS
to /usr/local/bin/trunk
. To do this, you may add the following line to your ~/.bashrc
(or ~/.bash_profile
in MacOS):
export HUNPOS=/usr/local/bin/trunk
and restart your terminal.
That worked for me, but if someone has a better, shorter or simpler way to set this up, please I'd love to hear :)