Question

In python, strings may be unicode ( both utf-16 and utf-8 ) and single-byte with different encodings ( cp1251, cp1252 etc ). Is it possible to check what encoding string is? For example,

time.strftime( "%b" )

will return a string with text name of a month. Under MacOS returned string will be utf-16, under Windows with English local it will be single byte with ascii encoding, and under Windows with non-English locale it will be encoded via locale's codepage, for example cp1251. How can i handle such strings?

Was it helpful?

Solution

Strings don't store any encoding information, you just have to specify one when you convert to/from unicode or print to an output device :

import locale
lang, encoding = locale.getdefaultlocale()
mystring = u"blabla"
print mystring.encode(encoding)

UTF-8 is not unicode, it's an encoding of unicode into single byte strings.

The best practice is to work with unicode everywhere on the python side, store your strings with an unicode reversible encoding such as UTF-8, and convert to fancy locales only for user output.

OTHER TIPS

charset encoding detection is very complex.

however, what's your real purpose for this? if you just want to value to be in unicode, simply write

unicode(time.strftime("%b"))

and it should work for all the cases you've mentioned above:

  • mac os: unicode(unicode) -> unicode
  • win/eng: unicode(ascii) -> unicode
  • win/noneng: unicode(some_cp) -> will be converted by local cp -> unicode

If you have a reasonably long string in an unknown encoding, you can try to guess the encoding, e.g. with the Universal Encoding Detector at https://github.com/dcramer/chardet -- not foolproof of course, but sometimes it guesses right;-). But that won't help much with very short strings.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top