Question

I load pdfdoc, by:

PdfReader pdfReader = new PdfReader(byteArray);
LocationTextExtractionStrategyEx st3 = new LocationTextExtractionStrategyEx();
PdfTextExtractor.GetTextFromPage(pdfReader, 1, st3);

Now I can get list of page elements from st3.TextLocationInfo. Every element has property TopLeft and BottomRight, they are Vector. How can I get element position if I don't know max value of scale. I know that vector start on left bottom page corner but I don't know where is end because I don't know page size in the same scale like vector.

I can run

var pageSize = pdfReader.GetPageSize(1)

But values from vectors are bigger than pageSize Width and Height

On the other hand, can I load every char position on page?

Was it helpful?

Solution 2

I read page size by

var pageSize = pdfReader.GetPageSize(1)

next I created

TextInfoLocation textLocation = new TextInfoLocation(textLine.TopLeft, textLine.BottomRight, this.PdfFilePageSize);

Properties .TopLeft and .BottomRight are vectors. textLine is LocationTextExtractionStrategyEx.TextInfo object read from pdfReader by strategy.

Now text position in pixels form vectores I can get from:

double leftMargin = textLocation.LeftMargin;

OTHER TIPS

LocationTextExtractionStrategyEx is not part of iTextSharp. I assume, therefore, you talk about the class provided in this answer. That class does nothing fancy with the positions. Thus, to respond to your issue:

I know that vector start on left bottom page corner but I don't know where is end because I don't know page size in the same scale like vector.

I can run

var pageSize = pdfReader.GetPageSize(1)

But values from vectors are bigger than pageSize Width and Height

First of all: the coordinates you get from LocationTextExtractionStrategyEx.TextLocationInfo indeed are to be interpreted in the context of pdfReader.GetPageSize.

There are two major causes why the vector values can be beyond Width and Height of the latter:

  1. The rectangle returned by pdfReader.GetPageSize does not need to be based in (0,0). It could e.g. have x coordinates in 301..400 and y coordinates in 501..600. In that case height and width would both be 100 but all coordinates of points in that rectangle would be higher.

    Thus, do not look at Width and Height but instead at Left, Bottom, Right, and Top.

  2. Text may actually be outside the visible page and, therefore, have coordinates outside of pdfReader.GetPageSize.

For a final verdict please supply the PDF in question.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top