Pergunta

I've been using a custom trained nltk pos_tagger and sometimes I get obvious verbs (ending with ING or ED) come in as NN's. How do I get the tagger to process all NN's through an additional regexpTagger just to find the additional verbs?

I've included some sample code for the secondary regex tagger.

from nltk.tag.sequential import RegexpTagger

rgt = RegexpTagger(
    (r'.*ing$', 'VBG'),                # gerunds
    (r'.*ed$', 'VBD'),                 # past tense verbs
])

Thanks

Foi útil?

Solução

Here is tri_gram tagger which is backed off by bi-gram (which is backed off by uni-gram) and the primary back-off tragger being the regex tragger. So, the last tagging here will be left to regex if any of the other tagger fails to tag it on the basis of rules defined here. Hope this helps you to build your own regex tagger of your rules.

   from nltk.corpus import brown
   import sys
   from nltk import pos_tag
   from nltk.tokenize import word_tokenize
   import nltk
   from nltk import ne_chunk
   def tri_gram():
   ##Trigram tagger done by training data from brown corpus 
    b_t_sents=brown.tagged_sents(categories='news')

   ##Making n-gram tagger using Turing backoff
   default_tagger = nltk.RegexpTagger(
            [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
         (r'(The|the|A|a|An|an)$', 'AT'),   # articles
         (r'.*able$', 'JJ'),                # adjectives
         (r'.*ness$', 'NN'),                # nouns formed from adjectives  
         (r'.*ly$', 'RB'),                  # adverbs
         (r'.*s$', 'NNS'),                  # plural nouns  
         (r'.*ing$', 'VBG'),                # gerunds   
         (r'.*ed$', 'VBD'),                 # past tense verbs
         (r'.*', 'NN')                      # nouns (default)
        ])
    u_gram_tag=nltk.UnigramTagger(b_t_sents,backoff=default_tagger) 
    b_gram_tag=nltk.BigramTagger(b_t_sents,backoff=u_gram_tag)
    t_gram_tag=nltk.TrigramTagger(b_t_sents,backoff=b_gram_tag)

    ##pos of given text
    f_read=open(sys.argv[1],'r')
    given_text=f_read.read();
    segmented_lines=nltk.sent_tokenize(given_text) 
    for text in segmented_lines:
        words=word_tokenize(text)
        sent = t_gram_tag.tag(words)
        print ne_chunk(sent)
tri_gram()
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top