Question

Say I had the following code-

homeDir = os.path.expanduser("~")
fullPath = homeDir + "/.config"
print fullPath

Would this code still function properly for someone in say, Japan, who's home directory was composed of Kanji?

My concern is that python won't know how to add the two languages together, or even know what to do with the foreign characters.

Was it helpful?

Solution

All strings in your code from the question are bytestrings (sequence of bytes). They can represent anything, including a text encoded in some character encoding.

homeDir = os.path.expanduser("~") # input bytestring, returns bytestring
fullPath = homeDir + "/.config" # add 2 bytestrings 
print fullPath

The print works but you may see garbage in console if it uses different character encoding. Otherwise the code will work for any language, foreign characters.


On Python 3 or if from __future__ import unicode_literals is used, string literals are Unicode. In this case it should also works:

from __future__ import unicode_literals

homeDir = os.path.expanduser("~") # input Unicode, returns Unicode
fullPath = homeDir + "/.config" # add 2 Unicode strings
print(fullPath) # print Unicode

The print may fail (try to set appropriate PYTHONIOENCODING in this case).

On POSIX systems, paths may contain arbitrary byte sequences (except zero byte) including those that can't be decoded using a file system encoding. From Python 3 docs:

In Python, file names, command line arguments, and environment variables are represented using the string type. On some systems, decoding these strings to and from bytes is necessary before passing them to the operating system. Python uses the file system encoding to perform this conversion (see sys.getfilesystemencoding()).

Changed in version 3.1: On some systems, conversion using the file system encoding may fail. In this case, Python uses the surrogateescape encoding error handler, which means that undecodable bytes are replaced by a Unicode character U+DCxx on decoding, and these are again translated to the original byte on encoding.

It means that fullPath might contain U+DCxx surrogates if the original contains undecodable bytes and print(fullPath) may fail even if terminal uses compatible character encoding. os.fsencode(fullPath) can return the original bytes if you need it.

OTHER TIPS

I would recommend reading this presentation on unicode and encoding in python to understand what might happen, and how to tackle it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top