Question

How do I create document info dictionary keys containing unicode characters (typically swedish characters, for instance C3A4 U+00E4 ä). I would like to use the PdfStamper to enter my own metadata in the document info dictionary, but I can't get it to accept the swedish characters.

Entering custom metadata using Acrobat works fine and looking at the PDF in a text editor I can see that the characters get encoded like for instance #C3#A4 for the character mentioned above. So is there a way to achieve this programmatically using iText PdfStamper???

regards Mattias

PS. There is no problem having unicode characters in the info dictionary values, but the keys are a different story.

Was it helpful?

Solution

Please take a look at the NameObject example, and give it a try. You'll see that iText automatically escapes special characters in names.

iText follows the ISO-32000-1 specification that stats (7.3.5, Name Objects):

Beginning with PDF 1.2 a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). Uniquely defined means that any two name objects made up of the same sequence of characters denote the same object. Atomic means that a name has no internal structure; although it is defined by a sequence of characters, those characters are not considered elements of the name.

not part of the name but is a prefix indicating that what follows is a sequence of characters representing the name in the PDF file and shall follow these rules:

a) A NUMBER SIGN (23h) (#) in a name shall be written by using its 2-digit hexadecimal code (23), preceded by the NUMBER SIGN.

b) Any character in a name that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN.

c) Any character that is not a regular character shall be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only.

NOTE 1: There is not a unique encoding of names into the PDF file because regular characters may be coded in either of two ways.

White space used as part of a name shall always be coded using the 2-digit hexadecimal notation and no white space may intervene between the SOLIDUS and the encoded name.

Regular characters that are outside the range EXCLAMATION MARK(21h) (!) to TILDE (7Eh) (~) should be written using the hexadecimal notation.

The token SOLIDUS (a slash followed by no regular characters) introduces a unique valid name defined by the empty sequence of characters.

NOTE 2 The examples shown in Table 4 and containing # are not valid literal names in PDF 1.0 or 1.1.

I'm not copy/pasting table 4, but I don't see any example that uses characters that consist of two bytes. Can you share a PDF that contains a name with a two-byte character that behaves in the way you desire? The PDF specification explicitly says that characters in the context of names are 8-bit values. You seem to be talking about 16-bit values...

Additional note: in the current implementation of iText, we only look at 8 bits:

c = (char)(chars[k] & 0xff);

We deliberately throw away all the higher bits when characters with more than 8 bits are passed.

Actually, I think I have answered your question. Initially, I thought you were asking to add this character: http://www.fileformat.info/info/unicode/char/c3a4/index.htm

As it turns out, you only need "\u00e4" (ä). I've made a small code sample that demonstrates how one would add a custom entry to the DID containing this character: ChangeInfoDictionary.

public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    Map<String, String> info = reader.getInfo();
    info.put("Special Character: \u00e4", "\u00e4");
    stamper.setMoreInfo(info);
    stamper.close();
    reader.close();
}

Granted, when you open the PDF in a PDF viewer, you don't necessarily see "Special Character: ä" as the key value, but that's a problem of the PDF viewer. When you open the PDF in a text editor, you clearly see:

/Special#20Character:#20#e4(ä)

Which means that iText has correctly escaped the special character.

However: as you pointed out in your comment, the character doesn't show up in Adobe Reader. Based on a PDF I created using Acrobat, I found a workaround by using the following code:

StringBuffer buf = new StringBuffer();
buf.append((char) 0xc3);
buf.append((char) 0xa4);
info.put(buf.toString(), "\u00e4");

Now the character is shown correctly. In other words: it's a matter of encoding...

OTHER TIPS

Just wanted to share a little experiment in C# illustrating one rather effortless way of getting the special characters into the document info dictionary keys.

        string inputString = "My key with åäö";
        byte[] inputBytes = Encoding.UTF8.GetBytes(inputString);
        string convertedString = Encoding.UTF7.GetString(inputBytes);
        info.Add(convertedString, "My value with åäö");

(info is the Dictionary used for adding metadata) Then just use the PdfStamper to get the info into the PDF. The metadata is stored correctly in the PDF and can be interpreted by Adobe Reader.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top