Question

My parser function uses lxml and provides me a list of unicode strings (book_list).

The strings are joined together into a file name, cleaned up and then passed via subprocess.call to another binary which continues the work.

My problem is that the unicode objects (e.g. title_name = u'Wunderlicher Traum von einem gro\xdfen Narrennest') are encoded in ISO-8859-2 (at least that's what 'chardet' tells me) and I need to convert them to a format, which gets properly displayed on file system level. The current code results the file name to be u'Wunderlicher Traum von einem gro\xc3\x9fen Narrennest'.

Does anyone have an idea what I'm doing wrong?

Some infos:

  • sys.getdefaultencoding() returns ascii, which confuses me, since that theoretically shouldn't allow any special characters like äöü etc.).
  • OS X 10.9, Python 2.7.5

def convert_books(book_list, output_dir):
    for book in book_list:
        author_name = book[0][0]
        title_name = book[0][1]
        #print chardet.detect(title_name)
        #print type(title_name)
        #print title_name.decode('iso-8859-2')
        year_name = "1337"

        output_file = u"%s - %s (%s).pdf" % (author_name, title_name, year_name)
        keep_characters = (' ', '.', '_')
        output_file.join(c for c in output_file if c.isalnum() or c in keep_characters).rstrip()
        path_to_out = "%s%s" % (output_dir, output_file)

        target_file = WORK_DIR + book[1].replace(".xml", ".html")

        engine_parameter = [
            WKHTMLTOPDF_BIN,

            # GENERAL
            "-l", # lower quality
            "-L", "25mm",
            "-R", "25mm",
            "-T", "25mm",
            "-B", "35mm",
            "--user-style-sheet", "media/style.css",

            target_file,
            path_to_out,
        ]
        print "+ Creating PDF \"%s\"" % (output_file)
        call(engine_parameter)
Was it helpful?

Solution

After writing down the question, the cause of the issued to be clear :)

  • \xdf is UTF-8
  • \xc3\x9fis ISO-8859-1 or latin-1

All I had to do was convert the utf-8 objects to latin-1 objects and then pass the arguments to subprocess.call.

out_enc = 'latin-1'
engine_parameter = [arg.encode(out_enc) if isinstance(arg, unicode) else arg for arg in engine_parameter]
call(engine_parameter)

Hope this will save someone else the headache!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top