Pergunta

I tried make part of speech (or POS) tagger in nltk but I can't get it to work for more than one ngram tagger for a time using backoff. I read that you use more taggers for higher scores but it won't work for me. I want it to first use more words then use less down to one. I tried it like this,

import nltk
from nltk.corpus import brown

#sentence =  brown.sents(categories = "news")
trains = brown.tagged_sents(categories = "news")


from nltk import NgramTagger

fortest = ["hi", "how","are", "you"]

tagger = (nltk.NgramTagger (n, trains, backoff=n-1) for n in range (3))
print tagger.tag(fortest)

But it gives me error AttributeError: 'generator' object has no attribute 'tag'

so i make it without a list:

for n in range(3):
    tagger = nltk.NgramTagger(n, trains, backoff=n-1)

But then I get:

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tag/sequential.py", line 271, in __init__
  ContextTagger.__init__(self, model, backoff)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tag/sequential.py", line 121, in __init__
  SequentialBackoffTagger.__init__(self, backoff)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tag/sequential.py", line 46, in __init__
  self._taggers = [self] + backoff._taggers AttributeError: 'int' object has no attribute '_taggers'

Please help would be appreciated for I am new to doing python.

Foi útil?

Solução

Spaceghost is correct, you need to provide a reference back to an actual NgramTagger object as the backoff argument and not just an int. Simply using a number as backoff is meaningless - when creating a new tagger, it has no idea where to look for the previously created tagger with a smaller relative context.

This is why you get the AttributeError: 'int' object has no attribute '_taggers'. NLTK is looking for an object of a class inheriting from SequentialBackoffTagger.

Based on your range(3), I'm going to guess you actually wanted a trigram tagger with backoff to a bigram tagger, with backoff to a unigram tagger.

You can try something like,

from nltk.corpus import brown
from nltk import NgramTagger

trains = brown.tagged_sents(categories="news")
tagger = None         # None here is okay since it's the default argument anyway
for n in range(1,4):  # start at unigrams (1) up to and including trigrams (3)
    tagger = NgramTagger(n, trains, backoff=tagger)

NOTE: No need to import nltk multiple times.

>>> tagger.tag('hi how are you'.split())
[('hi', None), ('how', 'WRB'), ('are', 'BER'), ('you', 'PPSS')]

Notice, we get None for the POS of words like "hi" since it doesn't occur in the given corpus (Brown's news category). You can set a default tagger if you want by initially setting tagger (before the for-loop) like,

from nltk import DefaultTagger
tagger = DefaultTagger('NN')

Outras dicas

The parameter backoff should point to another tagger that is to be used when the current one has done it's best. You need to define a second tagger or use the default and then change your code to use that. Something like this:

default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
tagger = nltk.NgramTagger(n, trains, backoff=default_tagger)
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top