Setting the correct encoding when piping stdout in Python

https://stackoverflow.com/questions/492483

20-08-2019
|

Question

When piping the output of a Python program, the Python interpreter gets confused about encoding and sets it to None. This means a program like this:

# -*- coding: utf-8 -*-
print u"åäö"

will work fine when run normally, but fail with:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)

when used in a pipe sequence.

What is the best way to make this work when piping? Can I just tell it to use whatever encoding the shell/filesystem/whatever is using?

The suggestions I have seen thus far is to modify your site.py directly, or hardcoding the defaultencoding using this hack:

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print u"åäö"

Is there a better way to make piping work?

Solution

Your code works when run in an script because Python encodes the output to whatever encoding your terminal application is using. If you are piping you must encode it yourself.

A rule of thumb is: Always use Unicode internally. Decode what you receive, and encode what you send.

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

Another didactic example is a Python program to convert between ISO-8859-1 and UTF-8, making everything uppercase in between.

import sys
for line in sys.stdin:
    # Decode what you receive:
    line = line.decode('iso8859-1')

    # Work with Unicode internally:
    line = line.upper()

    # Encode what you send:
    line = line.encode('utf-8')
    sys.stdout.write(line)

Setting the system default encoding is a bad idea, because some modules and libraries you use can rely on the fact it is ASCII. Don't do it.

OTHER TIPS

First, regarding this solution:

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

It's not practical to explicitly print with a given encoding every time. That would be repetitive and error-prone.

A better solution is to change sys.stdout at the start of your program, to encode with a selected encoding. Here is one solution I found on Python: How is sys.stdout.encoding chosen?, in particular a comment by "toka":

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

You may want to try changing the environment variable "PYTHONIOENCODING" to "utf_8". I have written a page on my ordeal with this problem.

Tl;dr of the blog post:

import sys, locale, os
print(sys.stdout.encoding)
print(sys.stdout.isatty())
print(locale.getpreferredencoding())
print(sys.getfilesystemencoding())
print(os.environ["PYTHONIOENCODING"])
print(chr(246), chr(9786), chr(9787))

gives you

utf_8
False
ANSI_X3.4-1968
ascii
utf_8
ö ☺ ☻

export PYTHONIOENCODING=utf-8

do the job, but can't set it on python itself ...

what we can do is verify if isn't setting and tell the user to set it before call script with :

if __name__ == '__main__':
    if (sys.stdout.encoding is None):
        print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
        exit(1)

Update to reply to the comment: the problem just exist when piping to stdout . I tested in Fedora 25 Python 2.7.13

python --version
Python 2.7.13

cat b.py

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import sys

print sys.stdout.encoding

running ./b.py

UTF-8

running ./b.py | less

None

I had a similar issue last week. It was easy to fix in my IDE (PyCharm).

Here was my fix:

Starting from PyCharm menu bar: File -> Settings... -> Editor -> File Encodings, then set: "IDE Encoding", "Project Encoding" and "Default encoding for properties files" ALL to UTF-8 and she now works like a charm.

Hope this helps!

An arguable sanitized version of Craig McQueen's answer.

import sys, codecs
class EncodedOut:
    def __init__(self, enc):
        self.enc = enc
        self.stdout = sys.stdout
    def __enter__(self):
        if sys.stdout.encoding is None:
            w = codecs.getwriter(self.enc)
            sys.stdout = w(sys.stdout)
    def __exit__(self, exc_ty, exc_val, tb):
        sys.stdout = self.stdout

Usage:

with EncodedOut('utf-8'):
    print u'ÅÄÖåäö'

I could "automate" it with a call to:

def __fix_io_encoding(last_resort_default='UTF-8'):
  import sys
  if [x for x in (sys.stdin,sys.stdout,sys.stderr) if x.encoding is None] :
      import os
      defEnc = None
      if defEnc is None :
        try:
          import locale
          defEnc = locale.getpreferredencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.getfilesystemencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.stdin.encoding
        except: pass
      if defEnc is None :
        defEnc = last_resort_default
      os.environ['PYTHONIOENCODING'] = os.environ.get("PYTHONIOENCODING",defEnc)
      os.execvpe(sys.argv[0],sys.argv,os.environ)
__fix_io_encoding() ; del __fix_io_encoding

Yes, it's possible to get an infinite loop here if this "setenv" fails.

I just thought I'd mention something here which I had to spent a long time experimenting with before I finally realised what was going on. This may be so obvious to everyone here that they haven't bothered mentioning it. But it would've helped me if they had, so on that principle...!

NB: I am using Jython specifically, v 2.7, so just possibly this may not apply to CPython...

NB2: the first two lines of my .py file here are:

# -*- coding: utf-8 -*-
from __future__ import print_function

The "%" (AKA "interpolation operator") string construction mechanism causes ADDITIONAL problems too... If the default encoding of the "environment" is ASCII and you try to do something like

print( "bonjour, %s" % "fréd" )  # Call this "print A"

You will have no difficulty running in Eclipse... In a Windows CLI (DOS window) you will find that the encoding is code page 850 (my Windows 7 OS) or something similar, which can handle European accented characters at least, so it'll work.

print( u"bonjour, %s" % "fréd" ) # Call this "print B"

will also work.

If, OTOH, you direct to a file from the CLI, the stdout encoding will be None, which will default to ASCII (on my OS anyway), which will not be able to handle either of the above prints... (dreaded encoding error).

So then you might think of redirecting your stdout by using

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

and try running in the CLI piping to a file... Very oddly, print A above will work... But print B above will throw the encoding error! The following will however work OK:

print( u"bonjour, " + "fréd" ) # Call this "print C"

The conclusion I have come to (provisionally) is that if a string which is specified to be a Unicode string using the "u" prefix is submitted to the %-handling mechanism it appears to involve the use of the default environment encoding, regardless of whether you have set stdout to redirect!

How people deal with this is a matter of choice. I would welcome a Unicode expert to say why this happens, whether I've got it wrong in some way, what the preferred solution to this, whether it also applies to CPython, whether it happens in Python 3, etc., etc.

On Ubuntu 12.10 and GNOME Terminal, no error is produced when the program is printing to stdout or hooked to a pipe for other programs. Both file encoding and terminal encoding is UTF-8.

$ cat a.py
# -*- coding: utf-8 -*-
print "åäö"
$ python a.py
åäö
$ python a.py | tee out
åäö

What OS and terminal emulator are you using? I heard some of my colleagues have similar problems when using iTerm 2 and OS X; iTerm 2 may be the culprit.

Update: This answer is wrong - see comments for details

I ran into this problem in a legacy application, and it was difficult to identify where what was printed. I helped myself with this hack:

# encoding_utf8.py
import codecs
import builtins


def print_utf8(text, **kwargs):
    print(str(text).encode('utf-8'), **kwargs)


def print_utf8(fn):
    def print_fn(*args, **kwargs):
        return fn(str(*args).encode('utf-8'), **kwargs)
    return print_fn


builtins.print = print_utf8(print)

On top of my script, test.py:

import encoding_utf8
string = 'Axwell Λ Ingrosso'
print(string)

Note that this changes ALL calls to print to use an encoding, so your console will print this:

$ python test.py
b'Axwell \xce\x9b Ingrosso'

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow