Question

I'm trying to use Tolkein's Silmarillion as a practice text for learning some NLP with nltk.

I am having trouble getting started because I'm running into text encoding issues.

I'm using the TextBlob wrapper (https://github.com/sloria/TextBlob) around NLTK because it's a lot easier. TextBlog is available at:

The sentence that I can't parse is:

"But Húrin did not answer, and they sat beside the stone, and did not speak again".

I believe it's the special character in Hurin causing the issue.

My code:

from text.blob import TextBlob
b = TextBlob( 'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
b.noun_phrases

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

As this is just a for-fun project, I just want to be able to use this text and extracting some attributes and do some basic processing.

How can I convert this text to ASCII when I don't know what the initial encoding is? I tried to decode from UTF8, then re-encode into ASCII:

>>> asc = unicode_text.decode('utf-8')
>>> asc = unicode_text.encode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)

But even that doesn't worry. Any suggestions are appreciated -- I'm fine with losing the special characters, as long as it's done consistently across the document.

I'm using python 2.6.8 with the required modules also correctly installed.

Was it helpful?

Solution

First, update TextBlob to the latest version (0.6.0 as of this writing), as there have some unicode fixes in recent updates. This can be done with

$ pip install -U textblob

Then, use a unicode literal, like so:

from text.blob import TextBlob
b = TextBlob( u'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
noun_phrases = b.noun_phrases
print noun_phrases
# WordList([u'h\xfarin'])
print noun_phrases[0]
# húrin

This is verified on Python 2.7.5 with TextBlob 0.6.0, but it should work with Python 2.6.8 as well.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top