Generate character images with a font whose name cannot be correctly decoded

Question 1

Silvia's comment on the OP...

You might want to consider specifying the encoding parameter like ImageFont.truetype(font_path,font_size,encoding="big5")

...gets you halfway there, but it looks like you also have to manually translate the Unicode characters if you're not using a Unicode font.

For the fonts which use "big5hkscs" encoding, I had to do this...

>>> u = u'\u6211'      # Unicode for 我
>>> u.encode('big5hkscs')
'\xa7\xda'

...then use u'\ua7da' to get the right glyph, which is a bit weird, but it looks to be the only way to pass a multi-byte character to PIL.

The following code works for me on both Python 2.7.4 and Python 3.3.1, with PIL 1.1.7...

from PIL import Image, ImageDraw, ImageFont


# Declare font files and encodings
FONT1 = ('Jin_Wen_Da_Zhuan_Ti.ttf',          'unicode')
FONT2 = ('Zhong_Guo_Long_Jin_Shi_Zhuan.ttf', 'big5hkscs')
FONT3 = ('Zhong_Yan_Yuan_Jin_Wen.ttf',       'big5hkscs')


# Declare a mapping from encodings used by str.encode() to encodings used by
# the FreeType library
ENCODING_MAP = {'unicode':   'unic',
                'big5':      'big5',
                'big5hkscs': 'big5',
                'shift-jis': 'sjis'}


# The glyphs we want to draw
GLYPHS = ((FONT1, u'\u6211'),
          (FONT2, u'\u6211'),
          (FONT3, u'\u6211'),
          (FONT3, u'\u66ce'),
          (FONT2, u'\u4e36'))


# Returns PIL Image object
def draw_glyph(font_file, font_encoding, unicode_char, glyph_size=128):

    # Translate unicode string if necessary
    if font_encoding != 'unicode':
        mb_string = unicode_char.encode(font_encoding)
        try:
            # Try using Python 2.x's unichr
            unicode_char = unichr(ord(mb_string[0]) << 8 | ord(mb_string[1]))
        except NameError:
            # Use Python 3.x-compatible code
            unicode_char = chr(mb_string[0] << 8 | mb_string[1])

    # Load font using mapped encoding
    font = ImageFont.truetype(font_file, glyph_size, encoding=ENCODING_MAP[font_encoding])

    # Now draw the glyph
    img = Image.new('L', (glyph_size, glyph_size), 'white')
    draw = ImageDraw.Draw(img)
    draw.text((0, 0), text=unicode_char, font=font)
    return img


# Save an image for each glyph we want to draw
for (font_file, font_encoding), unicode_char in GLYPHS:
    img = draw_glyph(font_file, font_encoding, unicode_char)
    filename = '%s-%s.png' % (font_file, hex(ord(unicode_char)))
    img.save(filename)

Note that I renamed the font files to the same names as the 7zip files. I try to avoid using non-ASCII characters in code examples, since they sometimes get screwed up when copy/pasting.

This example should work fine for the types declared in ENCODING_MAP, which can be extended if needed (see the FreeType encoding strings for valid FreeType encodings), but you'll need to change some of the code in cases where the Python str.encode() doesn't produce a multi-byte string of length 2.

Update

If the problem is in the ttf file, how could you find the answer in the PIL and FreeType source code? Above, you seem to be saying PIL is to blame, but why should one have to pass unicode_char.encode(...).decode(...) when you just want unicode_char?

As I understand it, the TrueType font format was developed before Unicode became widely adopted, so if you wanted to create a Chinese font back then, you'd have to have used one of the encodings which was in use at the time, and China had mostly been using Big5 since the mid 1980s.

It stands to reason, then, that there had to be a way to retrieve glyphs from a Big5-encoded TTF using the Big5 character encodings.

The C code for rendering a string with PIL starts with the font_render() function, and ultimately calls FT_Get_Char_Index() to locate the correct glyph, given the character code as an unsigned long.

However, PIL's font_getchar() function, which produces that unsigned long only accepts Python string and unicode types, and since it doesn't seem to do any translation of the character encodings itself, it seemed that the only way to get the correct value for the Big5 character set was to coerce a Python unicode character into the correct unsigned long value by exploiting the fact that u'\ua7da' was stored internally as the integer 0xa7da, either in 16 bits or 32 bits, depending on how you compiled Python.

TBH, there was a fair amount of guesswork involved, since I didn't bother to investigate what exactly the effect of ImageFont.truetype()'s encoding parameter is, but by the looks of it, it's not supposed to do any translation of character encodings, but rather to allow a single TTF file to support multiple character encodings of the same glyphs, using the FT_Select_Charmap() function to switch between them.

So, as I understand it, the FreeType library's interaction with the TTF files works something like this...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

class TTF(object):
    glyphs = {}
    encoding_maps = {}

    def __init__(self, encoding='unic'):
        self.set_encoding(encoding)

    def set_encoding(self, encoding):
        self.current_encoding = encoding

    def get_glyph(self, charcode):
        try:
            return self.glyphs[self.encoding_maps[self.current_encoding][charcode]]
        except KeyError:
            return ' '


class MyTTF(TTF):
    glyphs = {1: '我',
              2: '曎'}
    encoding_maps = {'unic': {0x6211: 1, 0x66ce: 2},
                     'big5': {0xa7da: 1, 0x93be: 2}}


font = MyTTF()
print 'Get via Unicode map: %s' % font.get_glyph(0x6211)
font.set_encoding('big5')
print 'Get via Big5 map: %s' % font.get_glyph(0xa7da)

...but it's up to each TTF to provide the encoding_maps variable, and there's no requirement for a TTF to provide one for Unicode. Indeed, it's unlikely that a font created prior to the adoption of Unicode would have.

Assuming all that is correct, then there's nothing wrong with the TTF - the problem is just with PIL making it a little awkward to access glyphs for fonts which don't have a Unicode mapping, and for which the required glyph's unsigned long character code is greater than 255.

Question 2

The problem is the fonts not strictly conforming to the TrueType specification. A quick solution is to use FontForge (you are using it already), and let it sanitize the fonts.

Open a font file
Go to Encoding, then select Reencode
Choose ISO 10646-1 (Unicode BMP)
Go to File then Generate Fonts
Save as TTF
Run your script with the newly generated fonts
Voila! It prints 我 in beautiful font!