XPDFs pdftotext converts pdf to text and outputs it at command line level. If needed it inserts PageBreaks between the pages as specified in TextOutputDev.cc:

eopLen = uMap->mapUnicode(0x0c, eop, sizeof(eop));

This Unicode symbol is encoding independent, -enc ASCII7 wouldn't change it. I'm currently willing to use PHP for converting and splitting the PDF file into several TXT pages for database storage. However, the following function does work, but takes twice as long as a conversion of the whole PDF in one time.

for($i = 1; $i <= $pages[0]; $i++)
    $page[$i] = shell_exec('/usr/bin/pdftotext sample.pdf -f '.$i.' -l '.$i.' -');

How am I supposed to explode(0x0c, $wholePDF) with an Unicode character as separator? Currently, page[$i] doesn't seem to retrieve those weird Unicode PageBreak characters from the shell_exec(). I tried several headers for encoding (UTF-8 especially) but it didn't work out so far.

有帮助吗?

解决方案

0x0c is an ASCII character (i.e. in the range 0-127), and as such in UTF-8 encoding it is represented as itself and not as a multibyte sequence. You should be able to explode(chr(0x0c), $wholePDF).

其他提示

I guess you can convert it to another type and then use the symbol to explode:

http://www.php.net/manual/en/ref.mbstring.php#74722

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top