NLTK spanish tagger results real bad?

https://stackoverflow.com/questions/23275398

python
nltk

09-07-2023
|

Pergunta

I'm trying to create a tagger performance comparisson for Spanish. My current script is modified version of this one, althoug I tried another version with very similar results.

I'm using the cess_esp corpus and have created a Unigram, Bigram, Trigram and Brill tagger for this corpus using the tagged sentences for training each of the taggers.

I'm concerned about he performance of the Bigram, Trigram taggers...they seem to be not working AT ALL from the results.

For instance, here is some output from my script:

*************** START TAGGING FOR LINE 6 ****************************************************************************************************************************************

Current line contents before tagging-> mejor ve a la sucursal de Juan Pablo II es la que menos gente tiene y no te tardas nada

Unigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', 'aq0cs0'), ('ve', 'vmip3s0'), ('a', 'sps00'), ('la', 'da0fs0'), ('sucursal', 'ncfs000'), ('de', 'sps00'), ('Juan', 'np0000p'), ('Pablo', None), ('II', None), ('es', 'vsip3s0'), ('la', 'da0fs0'), ('que', 'pr0cn000'), ('menos', 'rg'), ('gente', 'ncfs000'), ('tiene', 'vmip3s0'), ('y', 'cc'), ('no', 'rn'), ('te', 'pp2cs000'), ('tardas', None), ('nada', 'pi0cs000')]

Bigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', None), ('ve', None), ('a', None), ('la', None), ('sucursal', None), ('de', None), ('Juan', None), ('Pablo', None), ('II', None), ('es', None), ('la', None), ('que', None), ('menos', None), ('gente', None), ('tiene', None), ('y', None), ('no', None), ('te', None), ('tardas', None), ('nada', None)]

Trigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', None), ('ve', None), ('a', None), ('la', None), ('sucursal', None), ('de', None), ('Juan', None), ('Pablo', None), ('II', None), ('es', None), ('la', None), ('que', None), ('menos', None), ('gente', None), ('tiene', None), ('y', None), ('no', None), ('te', None), ('tardas', None), ('nada', None)]
****************************************************************************************************************************************

*************** START TAGGING FOR LINE 7 ****************************************************************************************************************************************

Current line contents before tagging-> He levantado ya varios reporte pero no resuelven nada

Unigram tagger-> [('He', 'vaip1s0'), ('levantado', 'vmp00sm'), ('ya', 'rg'), ('varios', 'di0mp0'), ('reporte', 'vmsp1s0'), ('pero', 'cc'), ('no', 'rn'), ('resuelven', None), ('nada', 'pi0cs000')]

Bigram tagger-> [('He', None), ('levantado', None), ('ya', None), ('varios', None), ('reporte', None), ('pero', None), ('no', None), ('resuelven', None), ('nada', None)]

Trigram tagger-> [('He', None), ('levantado', None), ('ya', None), ('varios', None), ('reporte', None), ('pero', None), ('no', None), ('resuelven', None), ('nada', None)]

*************** START TAGGING FOR LINE 8 ****************************************************************************************************************************************

Current line contents before tagging-> Es lamentable el servicio que brindan

Unigram tagger-> [('@ContactoBanamex', None), ('Es', 'vsip3s0'), ('lamentable', 'aq0cs0'), ('el', 'da0ms0'), ('servicio', 'ncms000'), ('que', 'pr0cn000'), ('brindan', None)]

Bigram tagger-> [('@ContactoBanamex', None), ('Es', None), ('lamentable', None), ('el', None), ('servicio', None), ('que', None), ('brindan', None)]

Trigram tagger-> [('@ContactoBanamex', None), ('Es', None), ('lamentable', None), ('el', None), ('servicio', None), ('que', None), ('brindan', None)]

Now the bigram and trigram are being trained as the indicated link, which is by the way, the more straight forward way as depicted in the NLTK book:

from nltk.corpus import cess_esp as cess
from nltk import BigramTagger as bt
from nltk import TrigramTagger as tt
cess_sents = cess.tagged_sents()
# Training BigramTagger.
bi_tag = bt(cess_sents)
#Training TrigramTagger
tri_tag = tt(cess_sents)

Any idea if I'm missing something here? Aren't bigram and trigram supposed to perform better than unigram? Should I use a backoff tagger always for bigram adn trigram?

Thanks! Alejandro

Solução

The spaghetti-tagger (https://code.google.com/p/spaghetti-tagger/) was created for simple tutorial purposes on how to easily create scalable taggers using NLTK corpus and tagging modules.

It is not meant to be a state-of-art system as the site suggests. It is advisable to use state-of-art taggers such as http://nlp.lsi.upc.edu/freeling/. I'll be happy to write a proper wrapper class in python for Freeling if you need it.

Back to your question, as Francis had hinted (https://groups.google.com/forum/#!topic/nltk-users/FtqksaZLLvY) , first go through the tutorial http://nltk.googlecode.com/svn/trunk/doc/howto/tag.html, then you will see that backoff parameter might resolves your problem

Disclaimer: I wrote the spaghetti.py https://spaghetti-tagger.googlecode.com/svn/spaghetti.py

Outras dicas

Jacob Perkins's tutorial blog posts on POS tagging with NLTK are probably one of the better online resources, in my opinion. He starts by building a simple backoff ngram tagger, then looks at adding in regexes and affix-based tagging, then Brill tagging, and then full-on classifier-based tagging. The posts are clear and easy to follow, and include some useful performance comparisons.

Start here and follow through to Part 4: http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow