Unicode in PDF

https://stackoverflow.com/questions/128162

02-07-2019
|

Question

My program generates relatively simple PDF documents on request, but I'm having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in PDF, you place it in brackets:

(something)

There is also the option to escape a character with octal codes:

(\527)

but this only goes up to 512 characters. How do you encode or escape higher characters? I've seen references to byte streams and hex-encoded strings, but none of the references I've read seem to be willing to tell me how to actually do it.

Edit: Alternatively, point me to a good Java PDF library that will do the job for me. The one I'm currently using is a version of gnujpdf (which I've fixed several bugs in, since the original author appears to have gone AWOL), that allows you to program against an AWT Graphics interface, and ideally any replacement should do the same.

The alternatives seem to be either HTML -> PDF, or a programmatic model based on paragraphs and boxes that feels very much like HTML. iText is an example of the latter. This would mean rewriting my existing code, and I'm not convinced they'd give me the same flexibility in laying out.

Edit 2: I didn't realise before, but the iText library has a Graphics2D API and seems to handle unicode perfectly, so that's what I'll be using. Though it isn't an answer to the question as asked, it solves the problem for me.

Edit 3: iText is working nicely for me. I guess the lesson is, when faced with something that seems pointlessly difficult, look for somebody who knows more about it than you.

Solution

The simple answer is that there's no simple answer. If you take a look at the PDF specification, you'll see an entire chapter — and a long one at that — devoted to the mechanisms of text display. I implemented all of the PDF support for my company, and handling text was by far the most complex part of exercise. The solution you discovered — use a 3rd party library to do the work for you — is really the best choice, unless you have very specific, special-purpose requirements for your PDF files.

OTHER TIPS

In the PDF reference in chapter 3, this is what they say about Unicode:

Text strings are encoded in either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode is described in the Unicode Standard by the Unicode Consortium (see the Bibliography). For text strings encoded in Unicode, the first two bytes must be 254 followed by 255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified in the Unicode standard. (This mechanism precludes beginning a string using PDFDocEncoding with the two characters thorn ydieresis, which is unlikely to be a meaningful beginning of a word or phrase).

Algoman's answer is wrong in many things. You can make a PDF documents with unicode in it' and it's not a rocket science, though it needs some work. Yes he is right, to use more than 255 characters in one font you have to create a composite font (CIDFont) pdf object. Then you just mention the actual TrueType font you want to use as a DescendatFont entry of CIDFont. The trick is that after that you have to use glyph indices of a font instead of character codes. To get this indices map you have to parse cmap section of a font - get contents of the font with GetFontData function and take hands on TTF specification. And that's it! I've just did it and now I have a unicode pdf!

Sample Code for parsing cmap section is here: https://support.microsoft.com/en-us/kb/241020

And yes, don't forget /ToUnicode entry as @user2373071 pointed out or user will not be able to search your PDF or copy text from it.

As dredkin pointed out, you have to use the glyph indices instead of the Unicode character value in the page content stream. This is sufficient to display Unicode text in PDF, but the Unicode text would not be searchable. To make the text searchable or have copy/paste work on it, you will also need to include a /ToUnicode stream. This stream should translate each glyph in the document to the actual Unicode character.

See Appendix D (page 995) of the PDF specification. There is a limited number of fonts and character sets pre-defined in a PDF consumer application. To display other characters you need to embed a font that contains them. It is also preferable to embed only a subset of the font, including only required characters, in order to reduce file size. I am also working on displaying Unicode characters in PDF and it is a major hassle.

Check out PDFBox or iText.

http://www.adobe.com/devnet/pdf/pdf_reference.html

I have worked several days on this subject now and what I have learned is that unicode is (as good as) impossible in pdf. Using 2-byte characters the way plinth described only works with CID-Fonts.

seemingly, CID-Fonts are a pdf-internal construct and they are not really fonts in that sense - they seem to be more like graphics-subroutines, that can be invoked by addressing them (with 16-bit addresses).

So to use unicode in pdf directly

you would have to convert normal fonts to CID-Fonts, which is probably extremely hard - you'd have to generate the graphics routines from the original font(?), extract character metrics etc.
you cannot use CID-Fonts like normal fonts - you cannot load or scale them the way you load and scale normal fonts
also, 2-byte characters don't even cover the full Unicode space

IMHO, these points make it absolutely unfeasible to use unicode directly.

What I am doing instead now is using the characters indirectly in the following way: For every font, I generate a codepage (and a lookup-table for fast lookups) - in c++ this would be something like

std::map<std::string, std::vector<wchar_t> > Codepage;
std::map<std::string, std::map<wchar_t, int> > LookupTable;

then, whenever I want to put some unicode-string on a page, I iterate its characters, look them up in the lookup-table and - if they are new, I add them to the code-page like this:

for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{                
    if(LookupTable[fontname].find(*i) == LookupTable[fontname].end())
    {
        LookupTable[fontname][*i] = Codepage[fontname].size();
        Codepage[fontname].push_back(*i);
    }
}

then, I generate a new string, where the characters from the original string are replaced by their positions in the codepage like this:

static std::string hex = "0123456789ABCDEF";
std::string result = "<";
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{                
    int id = LookupTable[fontname][*i] + 1;
    result += hex[(id & 0x00F0) >> 4];
    result += hex[(id & 0x000F)];
}
result += ">";

for example, "H€llo World!" might become <01020303040506040703080905> and now you can just put that string into the pdf and have it printed, using the Tj operator as usual...

but you now have a problem: the pdf doesn't know that you mean "H" by a 01. To solve this problem, you also have to include the codepage in the pdf file. This is done by adding an /Encoding to the Font object and setting its Differences

For the "H€llo World!" example, this Font-Object would work:

5 0 obj 
<<
    /F1
    <<
        /Type /Font
        /Subtype /Type1
        /BaseFont /Times-Roman
        /Encoding
        <<
          /Type /Encoding
          /Differences [ 1 /H /Euro /l /o /space /W /r /d /exclam ]
        >>
    >> 
>>
endobj

I generate it with this code:

ObjectOffsets.push_back(stream->tellp()); // xrefs entry
(*stream) << ObjectCounter++ << " 0 obj \n<<\n";
int fontid = 1;
for(std::list<std::string>::iterator i = Fonts.begin(); i != Fonts.end(); i++)
{
    (*stream) << "  /F" << fontid++ << " << /Type /Font /Subtype /Type1 /BaseFont /" << *i;

    (*stream) << " /Encoding << /Type /Encoding /Differences [ 1 \n";
    for(std::vector<wchar_t>::iterator j = Codepage[*i].begin(); j != Codepage[*i].end(); j++)
        (*stream) << "    /" << GlyphName(*j) << "\n";
    (*stream) << "  ] >>";

    (*stream) << " >> \n";
}
(*stream) << ">>\n";
(*stream) << "endobj \n\n";

Notice that I use a global font-register - I use the same font names /F1, /F2,... throughout the whole pdf document. The same font-register object is referenced in the /Resources Entry of all pages. If you do this differently (e.g. you use one font-register per page) - you might have to adapt the code to your situation...

So how do you find the names of the glyphs (/Euro for "€", /exclam for "!" etc.)? In the above code, this is done by simply calling "GlyphName(*j)". I have generated this method with a BASH-Script from the list found at

http://www.jdawiseman.com/papers/trivia/character-entities.html

and it looks like this

const std::string GlyphName(wchar_t UnicodeCodepoint)
{
    switch(UnicodeCodepoint)
    {
        case 0x00A0: return "nonbreakingspace";
        case 0x00A1: return "exclamdown";
        case 0x00A2: return "cent";
        ...
    }
}

A major problem I have left open is that this only works as long as you use at most 254 different characters from the same font. To use more than 254 different characters, you would have to create multiple codepages for the same font.

Inside the pdf, different codepages are represented by different fonts, so to switch between codepages, you would have to switch fonts, which could theoretically blow your pdf up quite a bit, but I for one, can live with that...

I'm not a PDF expert, and (as Ferruccio said) the PDF specs at Adobe should tell you everything, but a thought popped up in my mind:

Are you sure you are using a font that supports all the characters you need?

In our application, we create PDF from HTML pages (with a third party library), and we had this problem with cyrillic characters...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow