itext java pdf to text creation

Question 1

The reason for such missing space characters is that the space you see in the rendered PDF does not necessarily correspond to a space character in the page content description of the PDF. Instead you often find an operation in PDFs which after rendering one word moves the current position slightly to the right before rendering the next word.

Unfortunately the same mechanism also is used to enhance the appearance of adjacent glyphs: In some letter combinations, for a good appearance and reading experience the glyphs should be printed nearer to each other or farther from each other than they would be by default. This is done in PDFs using the same operation as above.

Thus, a PDF parser in such situations has to use heuristics to decide whether such a shift was meant to imply a space character or whether it was merely meant to make the letter group look good. And heuristics can fail.

You use SimpleTextExtractionStrategy as text extraction strategy. The heuristics in this case are implemented like this (as currently in the renderText method in SimpleTextExtractionStrategy.java in the iText 5.x github git develop branch):

float spacing = lastEnd.subtract(start).length();
if (spacing > renderInfo.getSingleSpaceWidth()/2f)
{
    result.append(' ');
}

Thus, a gap which is at least half as wide as the current width of as space character, is translated into a space character.

This generally sounds sensible. In case of documents, though, which only use horizontal shifts to separate words, the current widths of the actual space character may not be a good measure for the heuristics.

So, what you can do is try to improve the heuristics in the text extraction strategy. Copy the existing one, manipulate it, and use it in your code.

If you supply a sample PDF for your issue, we might have some ideas to help.

Question 2

you can use jasper reports. It works like a charm

Question 3

To expand on the brilliant explanation by mkl, here is a detail for a specific variation of the issue presented in the question. I stumbled upon a document from which I wanted to extract text. Every letter came out seperated by a space.

text would read as "t e x t"

I tried implementing my own extraction strategy class as outlined by mkl. Whichever factor I tried to apply to the "single space width" value, the text came out the same way as before. So I debugged my code to see the width value itself and it turned out to be 0.

To circumvent that you can use a fix value in the code outlined by mkl:

float spacing = lastEnd.subtract(start).length();
if (spacing > someFixValue)
{
    result.append(' ');
}

if you base your own extraction strategy on LocationTextExtractionStrategy, the method you want to override is: IsChunkAtWordBoundary(...)