I am generating text from pdf file with the help of pdftotext
My issue is not with pdftotext but it is with formating the text accordingly

Salman              Madhuri             Mohnish             Renuka                Anupam
Khan                Dixit               Behl                Shahane               Kher
Prem                Nisha Chou...       Rajesh              Pooja Chou...         Prof. Siddh


Hum Aapke Hain Koun...! (1994) - Full cast and crew
www.imdb.com/title/tt0110076/fullcredits
Hum Aapke Hain Koun...! on IMDb: Movies, TV, Celebs, and more... ... IMDbPro.com
offers representation listings for over 120,000 individuals, including actors, ...

I need output to be as

Salman Khan Prem
Madhuri Dixit Nisha Chou...
Mohnish Behl Rajesh
Renuka Shahane Pooja Chou...
Anupam Kher Prof.

Hum Aapke Hain Koun...! (1994) - Full cast and crew
www.imdb.com/title/tt0110076/fullcredits
Hum Aapke Hain Koun...! on IMDb: Movies, TV, Celebs, and more... ... IMDbPro.com
offers representation listings for over 120,000 individuals, including actors, ...
有帮助吗?

解决方案

Not sure what your delimiters are, but you could something like the following (kinda ugly, but it gets the job done):

$namesAndContent = explode("\r\n\r\n", $theString);
$nameRows = explode("\r\n", $namesAndContent[0]);
$names = array();
foreach ($nameRows as $row) {
    $items = preg_split('/\s{2,}/', $row);
    foreach ($items as $index => $namePart) {
        if (!array_key_exists($index, $names)) {
            $names[$index] = array();
        }
        $names[$index][] = $namePart;
    }

}

foreach ($names as $name) {
    echo implode(' ', $name) . "\r\n";
}
echo "\r\n";
echo $namesAndContent[1];

Demo: http://codepad.viper-7.com/Nr1Q4t

The above would format the data (when the delimiters are correct), but I am wondering where the data is coming from (originaly and not the pdf), because I suspect there is a better way to solve your problem. Perhaps there is some API you can directly utilize

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top