Flatten FDF / XFDF forms to PDF in PHP with utf-8 characters

https://stackoverflow.com/questions/3973283

09-10-2019
|

Question

My scenario:

A PDF template with formfields: template.pdf
An XFDF file that contains the data to be filled in: fieldData.xfdf

Now I need to have these to files combined & flattened. pdftk does the job easily within php:

exec("pdftk template.pdf fill_form fieldData.xfdf output flatFile.pdf flatten");

Unfortunately this does not work with full utf-8 support. For example: Cyrillic and greek letters get scrambled. I used Arial for this, with an unicode character set.

How can I accomplish to flatten my unicode files?
Is there any other pdf tool that offers unicode support?
Does pdftk have an unicode switch that I am missing?

EDIT 1: As this question has not been solved for more then 9 month, I decided to start a bounty for it. In case there are options to sponsor a feature or a bugfix in pdftk, I'd be glad to donate.

EDIT 2: I am not working on this project anymore, so I cannot verify new answers. If anyone has a similar problem, I am glad if they can respond in my favour.

Solution

Unfortunately, UTF-8 character encoding does not work neither with decimal nor hexadecimal references of non-ASCII characters in source .xfdf file. PDFTK v. 1.44.

OTHER TIPS

I found by using Jon's template but using the DomDocument the numeric encoding was handled for me and worked well. My slight variation is below:

$xml = new DOMDocument( '1.0', 'UTF-8' );

$rootNode = $xml->createElement( 'xfdf' );
$rootNode->setAttribute( 'xmlns', 'http://ns.adobe.com/xfdf/' );
$rootNode->setAttribute( 'xml:space', 'preserve' );
$xml->appendChild( $rootNode );

$fieldsNode = $xml->createElement( 'fields' );
$rootNode->appendChild( $fieldsNode );

foreach ( $fields as $field => $value )
{
    $fieldNode = $xml->createElement( 'field' );
    $fieldNode->setAttribute( 'name', $field );
    $fieldsNode->appendChild( $fieldNode );

    $valueNode = $xml->createElement( 'value' );
    $valueNode->appendChild( $xml->createTextNode( $value ) );
    $fieldNode->appendChild( $valueNode );
}

$xml->save( $file );

You could try the trial version of http://www.adobe.com/products/livecycle/designer/ and see what PDF files it generates.

Another commercial software you could try is http://www.appligent.com/fdfmerge. See page 16 in http://146.145.110.1/docs/userguide/FDFMergeUserGuide.pdf for how it handles xFDF with UTF-8.

I also had a look at the FDF specification http://partners.adobe.com/public/developer/en/xml/xfdf_2.0.pdf On page 12 it states:

Although XFDF is encoded in UTF-8, double byte characters are encoded as character references when 
exported from Acrobat. 
For example, the Japanese double byte characters ,  , and  are exported to XFDF using 
three character references. Here is an example of double byte characters in a form field: 
  ...
<fields>  
  <field name="Text1"> 
     <value>Here are 3 UTF-8 double byte  
        characters: &#x3042;&#x3044;&#x3046;
</value>  
  </field>  
</fields> ...

I looked through pdftk-1.44-dist/java/com/lowagie/text/pdf/XfdfReader.java. It doesn't seem to do anything special with the input.

Maybe pdftk will do what you want, when you encode the weird characters as character references in your xFDF input.

Using the pdftk 1.44 on a Win7 machine I encounter the same problems with xfdf-files whereas fdf works fine. I made a xfdf-file without any special characters (only ANSI) but pdftk crashed again. I mailed the developper. Unfortunately no answer until now.

I made some progress on this. Starting with code from http://koivi.com/fill-pdf-form-fields/, I modified the value encoding to output numeric codes for any characters outside the ascii range.

Now with pitulski's special strings:

Poznań Śródmieście Ćwiartka Ósma outputs Pozna ródmiecie wiartka Ósma with some box shapes superimposed

ęóąśłżźćńĘÓĄŚŁŻŹĆŃ outputs óÓ with more box shapes. I think it may be that the box shapes are characters my server doesn't recognize.

I tried it with some French characters: ùûüÿ€’“”«»àâæçéèêëïôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔ and they all came out OK, but some of them were overlapping.

--edit-- I just tried entering these manually into the form and got the same result minus the box shapes (using Evince). I then tried with a different form (created by someone else) - after entering ęóąśłżźćńĘÓĄŚŁŻŹĆŃ, ółÓŁ was displayed. It looks like it depends which characters are included in the document's embedded fonts.

/*
KOIVI HTML Form to FDF Parser for PHP (C) 2004 Justin Koivisto
Version 1.2.?
Last Modified: 2013/01/17 - Jon Hulka(jon dot hulka at gmail dot com)
  - changed character encoding, all non-ascii characters get encoded as numeric character references

    This library is free software; you can redistribute it and/or modify it
    under the terms of the GNU Lesser General Public License as published by
    the Free Software Foundation; either version 2.1 of the License, or (at
    your option) any later version.

    This library is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
    License for more details.

    You should have received a copy of the GNU Lesser General Public License
    along with this library; if not, write to the Free Software Foundation,
    Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 

    Full license agreement notice can be found in the LICENSE file contained
    within this distribution package.

    Justin Koivisto
    justin dot koivisto at gmail dot com
    http://koivi.com
*/

/**
 * createXFDF
 * 
 * Tales values passed via associative array and generates XFDF file format
 * with that data for the pdf address sullpiled.
 * 
 * @param string $file The pdf file - url or file path accepted
 * @param array $info data to use in key/value pairs no more than 2 dimensions
 * @param string $enc default UTF-8, match server output: default_charset in php.ini
 * @return string The XFDF data for acrobat reader to use in the pdf form file
 */
function createXFDF($file,$info,$enc='UTF-8'){
    $data=
'<?xml version="1.0" encoding="'.$enc.'"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
    <fields>';
    foreach($info as $field => $val){
        $data.='
        <field name="'.$field.'">';
        if(is_array($val)){
            foreach($val as $opt)
//2013.01.17 - Jon Hulka - all non-ascii characters get character references
            $data.='
            <value>'.mb_encode_numericentity(htmlspecialchars($opt),array(0x0080, 0xffff, 0, 0xffff), 'UTF-8').'</value>';
//                $data.='<value>'.htmlentities($opt,ENT_COMPAT,$enc).'</value>'."\n";
        }else{
            $data.='
            <value>'.mb_encode_numericentity(htmlspecialchars($val),array(0x0080, 0xffff, 0, 0xffff), 'UTF-8').'</value>';
//            $data.='<value>'.htmlentities($val,ENT_COMPAT,$enc).'</value>'."\n";
        }
        $data.='
        </field>';
    }
    $data.='
    </fields>
    <ids original="'.md5($file).'" modified="'.time().'" />
    <f href="'.$file.'" />
</xfdf>';
    return $data;
}

What PDFTK's version? I tried the same thing with Polish characters (utf-8).

Does not work for me.

pdftk.exe, libiconv2.dll from: http://www.pdflabs.com/docs/install-pdftk/

Windows 7, cmd, file.pdf + file.fdf -> new.pdf

pdftk file.pdf fill_form file.xfdf output new.pdf flatten

Unhandled Java Exception:
java.lang.NoClassDefFoundError: gnu.gcj.convert.Input_UTF8 not found in [file:.\, core:/]
   at 0x005a3abe (Unknown Source)
   at 0x005a3fb2 (Unknown Source)
   at 0x006119f4 (Unknown Source)
   at 0x00649ee4 (Unknown Source)
   at 0x005b4c44 (Unknown Source)
   at 0x005470a9 (Unknown Source)
   at 0x00549c52 (Unknown Source)
   at 0x0059d348 (Unknown Source)
   at 0x007323c9 (Unknown Source)
   at 0x0054715a (Unknown Source)
   at 0x00562349 (Unknown Source)

But, with FDF file, with the same content, it worked properly. But the characters in new.PDF are bad.

pdftk file.pdf fill_form file.fdf output new.pdf flatten

---FDF---

%FDF-1.2
%âãÏÓ
1 0 obj<</FDF<</F(file.pdf)
/Fields[
<</T(Miejsce)/V(666 Poznań Śródmieście Ćwiartka Ósma)>>
<</T(Nr)/V(ęóąśłżźćńĘÓĄŚŁŻŹĆŃ)>>
]>>>>
endobj
trailer
<</Root 1 0 R>>
%%EOF

---XFDF---

<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<f href="file.pdf"/>
<fields>
<field name="Miejsce">
<value>666 Poznań Śródmieście Ćwiartka Ósma</value>
</field>
<field name="Nr">
<value>ęóąśłżźćńĘÓĄŚŁŻŹĆŃ</value>
</field>
</fields>
</xfdf>

---PDF---

Miejsce: 666 PoznaÅ— ÅırÃ³dmieÅłcie Äƒwiartka Ãfisma
Nr: ÄŽÃ³Ä–ÅłÅ‡Å¼ÅºÄ⁄Å—ÄŸÃfiÄ—ÅıÅ†Å»Å¹ÄƒÅ

You can introduce utf-8 characters by giving their unicode code in octal with \ddd

To solve this, I wrote PdfFormFillerUTF-8: http://sourceforge.net/projects/pdfformfiller2/

There is a drop-in replacement for pdftk tool

Mcpdf: https://github.com/m-click/mcpdf

that solves unicode issues when filling forms. Works for me with CP1250 characters (Central Europe).

From project page:

the following command fills in form data from DATA.xfdf into FORM.pdf and writes the result to RESULT.pdf. It also flattens the document to prevent further editing:

java -jar mcpdf.jar FORM.pdf fill_form - output - flatten < DATA.xfdf > RESULT.pdf

This corresponds exactly to the usual PDFtk command:

pdftk FORM.pdf fill_form - output - flatten < DATA.xfdf > RESULT.pdf

Note that you need to have JRE installed.

I have managed to make it work with pdftk by creating a xfdf file with utf-8 encoding.

it took several tried but what make it work as exepcted was to add 'need_appearances'

here is an example:

pdftk source.pdf fill_form data.xfdf output output.pdf need_appearances

pdftk supports encoding in UTF-16BE. It's not that difficult to convert from UTF-8 to UTF-16BE.

See: Weird characters when filling PDF with PDFTk

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow