Question

I have a list of tuples of unicode objects:

>>> t = [('亀',), ('犬',)]

Printing this out, I get:

>>> print t
[('\xe4\xba\x80',), ('\xe7\x8a\xac',)]

which I guess is a list of the utf-8 byte-code representation of those strings?

but what I want to see printed out is, surprise:

[('亀',), ('犬',)]

but I'm having an inordinate amount of trouble getting the bytecode back into a human-readable form.

Was it helpful?

Solution

but what I want to see printed out is, surprise:

[('亀',), ('犬',)]

What do you want to see it printed out on? Because if it's the console, it's not at all guaranteed your console can display those characters. This is why Python's ‘repr()’ representation of objects goes for the safe option of \-escapes, which you will always be able to see on-screen and type in easily.

As a prerequisite you should be using Unicode strings (u''). And, as mentioned by Matthew, if you want to be able to write u'亀' directly in source you need to make sure Python can read the file's encoding. For occasional use of non-ASCII characters it is best to stick with the escaped version u'\u4e80', but when you have a lot of East Asian text you want to be able to read, “# coding=utf-8” is definitely the way to go.

print '[%s]' % ', '.join([', '.join('(%s,)' % ', '.join(ti) for ti in t)])

That would print the characters unwrapped by quotes. Really you'd want:

def reprunicode(u):
    return repr(u).decode('raw_unicode_escape')

print u'[%s]' % u', '.join([u'(%s,)' % reprunicode(ti[0]) for ti in t])

This would work, but if the console didn't support Unicode (and this is especially troublesome on Windows), you'll get a big old UnicodeError.

In any case, this rarely matters because the repr() of an object, which is what you're seeing here, doesn't usually make it to the public user interface of an application; it's really for the coder only.

However, you'll be pleased to know that Python 3.0 behaves exactly as you want:

  • plain '' strings without the ‘u’ prefix are now Unicode strings
  • repr() shows most Unicode characters verbatim
  • Unicode in the Windows console is better supported (you can still get UnicodeError on Unix if your environment isn't UTF-8)

Python 3.0 is a little bit new and not so well-supported by libraries, but it might well suit your needs better.

OTHER TIPS

First, there's a slight misunderstanding in your post. If you define a list like this:

>>> t = [('亀',), ('犬',)]

...those are not unicodes you define, but strs. If you want to have unicode types, you have to add a u before the character:

>>> t = [(u'亀',), (u'犬',)]

But let's assume you actually want strs, not unicodes. The main problem is, __str__ method of a list (or a tuple) is practically equal to its __repr__ method (which returns a string that, when evaluated, would create exactly the same object). Because __repr__ method should be encoding-independent, strings are represented in the safest mode possible, i.e. each character outside of ASCII range is represented as a hex character (\xe4, for example).

Unfortunately, as far as I know, there's no library method for printing a list that is locale-aware. You could use an almost-general-purpose function like this:

def collection_str(collection):
    if isinstance(collection, list):
        brackets = '[%s]'
        single_add = ''
    elif isinstance(collection, tuple):
        brackets = '(%s)'
        single_add =','
    else:
        return str(collection)
    items = ', '.join([collection_str(x) for x in collection])
    if len(collection) == 1:
        items += single_add
    return brackets % items

>>> print collection_str(t)
[('亀',), ('犬',)]

Note that this won't work for all possible collections (sets and dictionaries, for example), but it's easy to extend it to handle those.

Python source code files are strictly ASCII, so you must use the \u escape sequences unless you specify an encoding. See PEP 0263.

#!/usr/bin/python
# coding=utf-8
t = [u'亀', u'犬']
print t

When you pass an array to print, Python converts the object into a string using Python's rules for string conversions. The output of such conversions are designed for eval(), which is why you see those \u sequences. Here's a hack to get around that based on bobince's solution. The console must accept Unicode or this will throw an exception.

t = [(u'亀',), (u'犬',)]
print repr(t).decode('raw_unicode_escape')

Try:

import codecs, sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

So this appears to do what I want:

print '[%s]' % ', '.join([', '.join('(%s,)' % ', '.join(ti) for ti in t)])


>>> t = [('亀',), ('犬',)]
>>> print t
[('\xe4\xba\x80',), ('\xe7\x8a\xac',)]
>>> print '[%s]' % ', '.join([', '.join('(%s,)' % ', '.join(ti) for ti in t)])
[(亀,), (犬,)]

Surely there's a better way to do it.

(but other two answers thus far don't result in the original string being printed out as desired).

It seems people are missing what people want here. When I print unicode from a tuple, I just want to get rid of the 'u' '[' '(' and quotes. What we want is a function like below. After scouring the Net it seems to be the cleanest way to get atomic displayable data. If the data is not in a tuple or list, I don't think this problem exists.

def Plain(self, U_String) :
          P_String = str(U_String)
          m=re.search("^\(\u?\'(.*)\'\,\)$", P_String)
          if (m) :  #Typical unicode
             P_String = m.group(1).decode("utf8")
          return P_String  
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top