Question

I have to convert some PDF files to TXT. I end up with "less" command, because for example pdftotext has some problems with tables in PDF. The problem is that when I ran the command from exec function (or shell_exec/system), less just showing me information, that selected PDF is binary file and result file is just TXT with PDF data in it. But when I do the same thing normally in terminal, everything is ok. I also tried to login as www_data user and ran command as this user, but there is also no problem.

Command:

$ less /var/www/original.pdf > /var/www/new.txt

PHP code:

exec("less -f /var/www/original.pdf > /var/www/new.txt 2>&1");

Result from PHP exec:

"/var/www/original.pdf" may be a binary file.  See it anyway?

The "-f" option in exec command is there because then you don't need to press "y" for "yes, I want to see it anyway."

set | grep less yields:

LESSCLOSE='/usr/bin/lesspipe %s %s'
LESSOPEN='| /usr/bin/lesspipe %s'
            Lossless LZW RLE Zip' -- "$cur" ));
                _apport_parameterless
                _apport_parameterless
                _apport_parameterless
                _apport_parameterless
_apport_parameterless () 
Was it helpful?

Solution

From what I read, your console is able to display a PDF file with less because you have an input preprocessor installed, like lesspipe or lessfile. The way to make less use those preprocessor is by reading an environment variable called LESSOPEN, which points to the lesspipe and lessfile script.

There might be a way your webserver, through environment variables and shell commands, might be able to replicate this behavior so that your calls to less parse PDFs properly.

What I would suggest would be to call a bash script to do the conversion for you instead of calling less directly. That way, your bash script would be able to set the appropriate environment variables and execute the appropriate commands to convert your PDF files to a readable output.

Here's an example of how to do this:

#!/bin/bash

eval $(lesspipe)
less $1 > $2 2>&1

Then, from PHP, call that script like this:

exec("/path/to/your/script/script.sh /var/www/original.pdf /var/www/new.txt");

If it doesn't work, try changing eval $(lesspipe) to eval $(lessfile).

OTHER TIPS

First of all, less is an interactive program to read text streams. In this context you should use cat instead. This or course won't work either since PDF is a binary format as opposed to text based.

Why don't you use a pdf to text converter like pdftotext?

How was the PHP code executed? On the command line, via php file.php or by a Web server when you hit it with a browser http://servername/something/file.php?

One guess is that the less you execute when doing it on the command line is not the same less as when the PHP code is run.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top