Writing unicode programs in python <= 2.7

https://stackoverflow.com/questions/18474939

26-06-2022
|

Question

What are some general guidlines to writing unicode programs in python <= 2.7? Is it good practice to prepend every string with u, even if it doesn't contain any characters outside of the ASCII range?

When dealing with sqlite3, will a parameterized query automatically encode unicode as utf-8, or does that need to be done manually?

When dealing with a 'string' of bytes, should this be left as a string object or decoded into a unicode string? (I believe this would throw an exception in most cases )

If for any reason I need to use a literal unicode character in the code, can I just use that character in a string as long as it is a unicode string and I have my encoding declared at the top of the file?

EDIT: When printing a unicode string, how do I get the locale of the user's system so that I can correctly encode it? Blindly encoding everything as utf-8 seems like a bad idea since not all systems support it. EDIT: I believe I figured this one out. It can be done using locale

import locale
encoding = locale.getpreferredencoding()

EDIT: Is this encoding actually done implicitly? Now I am very confused. On linux, I can do this

s = u'\u2c60'
print s # prints Ⱡ
print s.encode('utf-8') # prints Ⱡ

But on windows this happens

s = u'\u2c60'
print s # prints Ⱡ in IDLE, UnicodeEncodeError in cmd
print s.encode('cp1252') # UnicodeEncodeError
print s.encode('utf-8') # prints â±
print s.encode('cp1252', 'replace') # prints ?

It does seem like print does the conversion implicitly...

EDIT: This question says print will auto encode to the encoding stored in sys.stdout.encoding Why Does Python print unicode characters when the default encoding is ASCII?

Now I'm wondering, is there a way to make the default behavior of print to replace unencodable characters? Or do I need to wrap print in my own function, something like:

def myPrint(msg):
    print msg.encode(sys.stdout.encoding, 'replace')

I know most of these problems have been addressed in Python 3, but I would like to support python <= 2.7.

La solution

Is it good practice to prepend every string with u, even if it doesn't contain any characters outside of the ASCII range?

Yes, and also use an editor which works with unicode, and declare the encoding type at the top of each file.

In general, your pattern should be: read bytes, work internally with unicode, output bytes.

When dealing with sqlite3, will a parameterized query automatically encode unicode as utf-8, or does that need to be done manually?

Better to be safe than sorry, but in general I recommend that you test this out yourself.

When dealing with a 'string' of bytes, should this be left as a string object or decoded into a unicode string? (I believe this would throw an exception in most cases )

Yes, work internally with unicode. No, this won't throw an exception if you actually know the encoding. You should know the encoding. Make sure you know the encoding.

If for any reason I need to use a literal unicode character in the code, can I just use that character in a string as long as it is a unicode string and I have my encoding declared at the top of the file?

Yes, as long as your editor is unicode friendly.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow