Domanda

I'm trying to use Tolkein's Silmarillion as a practice text for learning some NLP with nltk.

I am having trouble getting started because I'm running into text encoding issues.

I'm using the TextBlob wrapper (https://github.com/sloria/TextBlob) around NLTK because it's a lot easier. TextBlog is available at:

The sentence that I can't parse is:

"But Húrin did not answer, and they sat beside the stone, and did not speak again".

I believe it's the special character in Hurin causing the issue.

My code:

from text.blob import TextBlob
b = TextBlob( 'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
b.noun_phrases

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

As this is just a for-fun project, I just want to be able to use this text and extracting some attributes and do some basic processing.

How can I convert this text to ASCII when I don't know what the initial encoding is? I tried to decode from UTF8, then re-encode into ASCII:

>>> asc = unicode_text.decode('utf-8')
>>> asc = unicode_text.encode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)

But even that doesn't worry. Any suggestions are appreciated -- I'm fine with losing the special characters, as long as it's done consistently across the document.

I'm using python 2.6.8 with the required modules also correctly installed.

È stato utile?

Soluzione

First, update TextBlob to the latest version (0.6.0 as of this writing), as there have some unicode fixes in recent updates. This can be done with

$ pip install -U textblob

Then, use a unicode literal, like so:

from text.blob import TextBlob
b = TextBlob( u'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
noun_phrases = b.noun_phrases
print noun_phrases
# WordList([u'h\xfarin'])
print noun_phrases[0]
# húrin

This is verified on Python 2.7.5 with TextBlob 0.6.0, but it should work with Python 2.6.8 as well.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top