Domanda

I use a itext for converting pdf to text file, it works good actually but for some words it do the following thing: for example in pdf there is phrase like "present the main ideas" but itext creates an output like "presentthemainideas". Is there anyway to correct this behaviour?

            String pdf="/home/can/Downloads/NLP/textSummarization/A New Approach for  Multi-Document Update Summarization.pdf";
    String txt="/home/can/myWorkSpace/PDFConverterProject/outputs/bb.txt";
    StringBuffer text=new StringBuffer() ;
    String resultText="";
    PdfReader reader;
    try {
        reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            text.append(strategy.getResultantText());

        }
        resultText=text.toString();
        resultText = resultText.replaceAll("-\n", "");
        out.println("-->"+resultText);

        StringTokenizer stringTokenizer=new StringTokenizer(resultText, "\n");
        PrintWriter lineWriter = new PrintWriter(new FileOutputStream("/home/can/myWorkSpace/PDFConverterProject/outputs/line.txt"));
        while (stringTokenizer.hasMoreTokens()){
            String curToken = stringTokenizer.nextToken();
            lineWriter.println("line-->"+curToken);
        }
        lineWriter.flush();
        lineWriter.close();
        out.flush();
        out.close();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}
È stato utile?

Soluzione

The reason for such missing space characters is that the space you see in the rendered PDF does not necessarily correspond to a space character in the page content description of the PDF. Instead you often find an operation in PDFs which after rendering one word moves the current position slightly to the right before rendering the next word.

Unfortunately the same mechanism also is used to enhance the appearance of adjacent glyphs: In some letter combinations, for a good appearance and reading experience the glyphs should be printed nearer to each other or farther from each other than they would be by default. This is done in PDFs using the same operation as above.

Thus, a PDF parser in such situations has to use heuristics to decide whether such a shift was meant to imply a space character or whether it was merely meant to make the letter group look good. And heuristics can fail.

You use SimpleTextExtractionStrategy as text extraction strategy. The heuristics in this case are implemented like this (as currently in the renderText method in SimpleTextExtractionStrategy.java in the iText 5.x github git develop branch):

float spacing = lastEnd.subtract(start).length();
if (spacing > renderInfo.getSingleSpaceWidth()/2f)
{
    result.append(' ');
}

Thus, a gap which is at least half as wide as the current width of as space character, is translated into a space character.

This generally sounds sensible. In case of documents, though, which only use horizontal shifts to separate words, the current widths of the actual space character may not be a good measure for the heuristics.

So, what you can do is try to improve the heuristics in the text extraction strategy. Copy the existing one, manipulate it, and use it in your code.

If you supply a sample PDF for your issue, we might have some ideas to help.

Altri suggerimenti

you can use jasper reports. It works like a charm

To expand on the brilliant explanation by mkl, here is a detail for a specific variation of the issue presented in the question. I stumbled upon a document from which I wanted to extract text. Every letter came out seperated by a space.

text would read as "t e x t"

I tried implementing my own extraction strategy class as outlined by mkl. Whichever factor I tried to apply to the "single space width" value, the text came out the same way as before. So I debugged my code to see the width value itself and it turned out to be 0.

To circumvent that you can use a fix value in the code outlined by mkl:

float spacing = lastEnd.subtract(start).length();
if (spacing > someFixValue)
{
    result.append(' ');
}

if you base your own extraction strategy on LocationTextExtractionStrategy, the method you want to override is: IsChunkAtWordBoundary(...)

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top