Question

I'm using the snowball stemmer to stem words in documents as shown in below code snippet.

    stemmer = EnglishStemmer()
    # Stem, lowercase, substitute all punctuations, remove stopwords.
    attribute_names = [stemmer.stem(token.lower()) for token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')]

When I run this on documents using PyDev in Eclipse, I receive no errors. When I run it in the Terminal (Mac OSX) I receive below error. Can someone please help?

File "data_processing.py", line 171, in __filter__
attribute_names = [stemmer.stem(token.lower()) for token in   wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower()     not in stopwords.words('english')]

File "7.3/lib/python2.7/site-packages/nltk-2.0.4-py2.7.egg/nltk/stem/snowball.py", line   694, in stem
word = (word.replace(u"\u2019", u"\x27")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)
Was it helpful?

Solution

This works in PyDev because it configures Python itself to work in the encoding of the console (which is usually UTF-8).

You can reproduce the same error in PyDev if you go to the run configuration (run > run configurations) then on the 'common' tab say that you want the encoding to be ascii.

This happens because your word is a string and you're replacing with unicode chars.

I hope the code below sheds some light for you:

This is all considering ascii as the default encoding:

>>> 'íã'.replace(u"\u2019", u"\x27")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa1 in position 0: ordinal not in range(128)

But if you do it all in unicode, it works (you may need to encode it back afterwards to the encoding you expect if you expect to deal with strings and not unicode).

>>> u'íã'.replace(u"\u2019", u"\x27")
u'\xed\xe3'

So, you can make your string unicode before the replace

>>> 'íã'.decode('cp850').replace(u"\u2019", u"\x27")
u'\xed\xe3'

Or you can encode the replace chars

>>> 'íã'.replace(u"\u2019".encode('utf-8'), u"\x27".encode('utf-8'))
'\xa1\xc6'

Note however that you must know what's the actual encoding you're working on in any place (so, although I'm using cp850 or utf-8 in the examples, it may be different from the encodings you have to use)

OTHER TIPS

As Fabio stated, this happens because Pydev changes Python's default encoding. One you know that, there are three possible solutions :

Test your code outside Pydev

Pydev will hide encoding issues from you, until you run your code outside of Eclipse. So instead of using Eclipse's "run" button, test your code from a shell.

I wouldn't recommend this, though : it means your development environment will be different from your running environment, which can only lead to mistakes being made.

Change Python's default encoding

You could change Python's environment to fit Pydev's. It is discussed in this question ( How to set the default encoding to UTF-8 in Python? ).

This answer will tell you how to do it, and this one will tell you why you shouldn't.

Long story short, don't.

Stop Pydev from changing Python's default encoding

If you're using Python 2, Python's default encoding should be ascii. So instead of making your environment fir Pydev's through a hack, you'd be better off forcing Pydev to "behave". How to do that is discussed here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top