Domanda

I am rather new to python, but since my native language includes some nasty umlauts, I have to dive into the nightmare that encoding is right at the start. I read joelonsoftware's text on encoding and understand the difference between codepoints and actual renderings of letters (and the connection between unicode and encodings). To get me out of trouble I found 3 ways to deal with umlauts, but I can't decide, which of them suits what situations. If someone could shed some lights on it? I want to be able to write text to file, read from it (or sqlite3) and give out text, all including readable umlauts... Thanks a lot!

# -*- coding: utf-8 -*-
import codecs

# using just u + string
with open("testutf8.txt", "w") as f:
    f.write(u"Österreichs Kapitän")

with open("testutf8.txt", "r") as f:
    print f.read()


# using encode/decode
s = u'Österreichs Kapitän'
sutf8 = s.encode('UTF-8')
with open('encode_utf-8.txt', 'w') as f2:
    f2.write(sutf8)
with open('encode_utf-8.txt','r') as f2:
    print f2.read().decode('UTF-8')


# using codec
with codecs.open("testcodec.txt", "w","utf-8") as f3:
    f3.write(u"Österreichs Kapitän")

with codecs.open("testcodec.txt", "r","utf-8") as f3:
    print f3.read() 

EDIT: I tested this (content of file is 'Österreichs Kapitän'):

with codecs.open("testcodec.txt", "r","utf-8") as f3:

    s= f3.read()
    print s
    s= s.replace(u"ä",u"ü")
    print s

Do I have to use u'string' (unicode) everywhere in my code? I found out, if I just use the blank string (without 'u'), the replacement of umlauts didn't work...

È stato utile?

Soluzione

As a general rule of thumb, you typically want to decode an encoded string as early as possible, then manipulate it as a unicode object and finally encode it as late as possible (before writing it to a file e.g.).

So e.g.:

with codecs.open("testcodec.txt", "r","utf-8") as f3:
    s = f3.read()

# modify s here

with codecs.open("testcodec.txt", "w","utf-8") as f3:
    f3.write(s)

As to your question, which way is the best to do it: I don't think there is a difference between using the codecs library or using encode/decode manually. It is a matter of preference, either works.

Simply using open, as in your first example, does not work as python will then try to encode the string using the default codec (which is ASCII, if you didn't change it).

Regarding the question whether you should use unicode strings everywhere: In principle, yes. If you create a string s = 'asdf' it has type str (you can check this with type(s)), and if you do s2 = u'asdf' it has type unicode. And since it is better to always manipulate unicode objects, the latter is recommended.

If you don't want to always have to append the 'u' in front of a string, you can use the following import:

from __future__ import unicode_literals

Then you can do s = 'asdf' and s will have the type unicode. In Python3 this is the default, so the import is only needed in Python2.

For potential gotchas you can take a look at Any gotchas using unicode_literals in Python 2.6?. Basically you don't want to mix utf-8 encoded strings and unicode strings.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top