Python NLTK snowball stemmer UnicodeDecodeError in terminal but not Eclipse PyDev

Question 1

This works in PyDev because it configures Python itself to work in the encoding of the console (which is usually UTF-8).

You can reproduce the same error in PyDev if you go to the run configuration (run > run configurations) then on the 'common' tab say that you want the encoding to be ascii.

This happens because your word is a string and you're replacing with unicode chars.

I hope the code below sheds some light for you:

This is all considering ascii as the default encoding:

>>> 'íã'.replace(u"\u2019", u"\x27")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa1 in position 0: ordinal not in range(128)

But if you do it all in unicode, it works (you may need to encode it back afterwards to the encoding you expect if you expect to deal with strings and not unicode).

>>> u'íã'.replace(u"\u2019", u"\x27")
u'\xed\xe3'

So, you can make your string unicode before the replace

>>> 'íã'.decode('cp850').replace(u"\u2019", u"\x27")
u'\xed\xe3'

Or you can encode the replace chars

>>> 'íã'.replace(u"\u2019".encode('utf-8'), u"\x27".encode('utf-8'))
'\xa1\xc6'

Note however that you must know what's the actual encoding you're working on in any place (so, although I'm using cp850 or utf-8 in the examples, it may be different from the encodings you have to use)

Question 2

As Fabio stated, this happens because Pydev changes Python's default encoding. One you know that, there are three possible solutions :

Test your code outside Pydev

Pydev will hide encoding issues from you, until you run your code outside of Eclipse. So instead of using Eclipse's "run" button, test your code from a shell.

I wouldn't recommend this, though : it means your development environment will be different from your running environment, which can only lead to mistakes being made.

Change Python's default encoding

You could change Python's environment to fit Pydev's. It is discussed in this question ( How to set the default encoding to UTF-8 in Python? ).

This answer will tell you how to do it, and this one will tell you why you shouldn't.

Long story short, don't.

Stop Pydev from changing Python's default encoding

If you're using Python 2, Python's default encoding should be ascii. So instead of making your environment fir Pydev's through a hack, you'd be better off forcing Pydev to "behave". How to do that is discussed here.