Should my python web app use unicode for all strings?

https://stackoverflow.com/questions/827415

05-07-2019
|

Question

I see some frameworks like Django using unicode all over the place so it seems like it might be a good idea.

On the other hand, it seems like a big pain to have all these extra 'u's floating around everywhere.

What will be a problem if I don't do this?

Are there any issues that will come up if I do do this?

I'm using Pylons right now as my framework.

Solution

In Python 3, all strings are Unicode. So, you can prepare for this by using u'' strings everywhere you need to, and then when you eventually upgrade to Python 3 using the 2to3 tool all the us will disappear. And you'll be in a better position because you will have already tested your code with Unicode strings.

See Text Vs. Data Instead Of Unicode Vs. 8-bit for more information.

OTHER TIPS

You can avoid the u'' in python 2.6 by doing:

from __future__ import unicode_literals

That will make 'string literals' to be unicode objects, just like it is in python 3;

What will be a problem if I don't do this?

I'm a westerner living in Japan, so I've seen first-hand what is needed to work with non-ASCII characters. The problem if you don't use Unicode strings is that your code will be a frustration to the parts of the world that use anything other than A-Z. Our company has had a great deal of frustration getting certain web software to do Japanese characters without making a total mess of it.

It takes a little effort for English speakers to appreciate how great Unicode is, but it really is a terrific bit of work to make computers accessible to all cultures and languages.

"Gotchas":

Make sure your output web pages state the encoding in use properly (e.g. using content-encoding header), and then encode all Unicode strings properly at output. Python 3 Unicode strings is a great improvement to do this right.
Do everything with Unicode strings, and only convert to a specific encoding at the last moment, when doing output. Other languages, such as PHP, are prone to bugs when manipulating Unicode in e.g. UTF-8 form. Say you have to truncate a Unicode string. If it's in UTF-8 form internally, there's a risk you could chop off a multi-byte character half-way through, resulting in rubbish output. Python's use of Unicode strings internally makes it harder to make these mistakes.

Using Unicode internally is a good way to avoid problems with non-ASCII characters. Convert at the boundaries of your application (incoming data to unicode, outgoing data to UTF-8 or whatever). Pylons can do the conversion for you in many cases: e.g. controllers can safely return unicode strings; SQLAlchemy models may declare Unicode columns.

Regarding string literals in your source code: the u prefix is usually not necessary. You can safely mix str objects containing ASCII with unicode objects. Just make sure all your string literals are either pure ASCII or are u"unicode".

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow