Losing accents with for loop in Python

Question 1

I've googled a bit on this problem, i found something here:

http://eclipsesource.com/blogs/2013/02/21/pro-tip-unicode-characters-in-the-eclipse-console/

Try going from the Launch Configuration dialog > Common > and set the encoding to utf-8 or latin-1.

If this doesn't solve the problem, try converting each character to utf-8 format and then print it:

line = unicode("áaáaáaá", encoding="utf-8")
for c in line:
    print c

Edit: Here's some explanation :)

When you don't specify the encoding as utf-8, the interpreter breaks it down in wrong parts. For example, à is stored as '\xc3\xa1`. In the loop, python thinks of it as two separate characters:

>>> s = "áaáaáaá".encode()
>>> for i, c in enumerate(s):
    print(i,c)


0 195
1 161
2 97
3 195
4 161
5 97
6 195
7 161
8 97
9 195
10 161

It thinks of \xc3\xa1 as two chars, which is:

Ã
¡

Why does it works when you specify the encoding, then? Well, i'm sure you got it already. When you set the encoding to utf-8, it treats the string with the format of utf-8, and it knows that \xc3\xa1 is one character.

Well, in my second method, it would work even if you don't set the encoding to utf-8. Why? Because this:

line = unicode("áaáaáaá", encoding="utf-8")

converts the encoding from utf-8 to what your interpreter uses.

Hope this helps!

Question 2

I tried the following on python interpretor to understand , hope this findings helps you !

\> line = "áaáaáaá"
\> line
'\xc3\xa1a\xc3\xa1a\xc3\xa1a\xc3\xa1'

This entire line was store as a utf-16 . Note á is converted into \xc3\xa1

line = "áaáaáaá"
for c in line:
    print c

The split of line happens like this - '\xc3' , '\xa1', 'a' , '\xc3' .... and this the output is something like � � a � � a � � a � �

So if you specify something like this -

\> line = unicode("áaáaáaá", encoding="utf-8")
\> line
u'\xe1a\xe1a\xe1a\xe1'

This will encode the unicode value of all characters in single byte itself.

Now the split of line happens like this - '\xe1', a, '\xe1', 'a', '\xe1', 'a', ...

and output is something like áaáaáaá