Question

I am trying to extract named entities from dutch text. I used nltk-trainer to train a tagger and a chunker on the conll2002 dutch corpus. However, the parse method from the chunker is not detecting any named entities. Here is my code:

str = 'Christiane heeft een lam.'

tagger = nltk.data.load('taggers/dutch.pickle')
chunker = nltk.data.load('chunkers/dutch.pickle')

str_tags = tagger.tag(nltk.word_tokenize(str))
print str_tags

str_chunks = chunker.parse(str_tags)
print str_chunks

And the output of this program:

[('Christiane', u'N'), ('heeft', u'V'), ('een', u'Art'), ('lam', u'Adj'), ('.', u'Punc')]
(S Christiane/N heeft/V een/Art lam/Adj ./Punc)

I was expecting Christiane to be detected as a named entity. Any help?

Was it helpful?

Solution

The conll2002 corpus has both spanish and dutch text, so you should make sure to use the fileids parameter, as in python train_chunker.py conll2002 --fileids ned.train. Training on both spanish and dutch will have poor results.

The default algorithm is a Tagger based Chunker, which does not work well on conll2002. Instead, use a classifier based chunker like NaiveBayes, so the full command might look like this (and I've confirmed that the resulting chunker does recognize "Christiane" as a "PER"):

python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes --filename ~/nltk_data/chunkers/conll2002_ned_NaiveBayes.pickle

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top