Pregunta

¿Alguien puede ayudarme a obtener coordenadas de píxeles reales para las secciones PDF Begintext? Estoy usando PDFBox para recuperar textos de los archivos PDF, pero ahora necesito obtener rectas que salgan a las secciones/párrafos de texto.

$contents = $page->getContents();
$contentsStream = $page->getContents()->getStream();
$resources=$page->getResources();
$fonts = $resources->getFonts();
$xobjects = $resources->getImages();
$tokens=$contentsStream->getStreamTokens();
  • PDFoperator {Q}, Cosfloat {690.48}, Cosint {0}, cosint {0}, cosfloat {633.6}, cosint {0}, cosint {0}, pdfoperator {cm}, cosname {im1}, pdfoperator {do}} , PDFoperator {Q},

  • Pdfoperator {bt}, cosint {1}, cosint {0}, cosint {0}, cosint {1}, cosfloat {25.92}, cosfloat {588.48}, pdfoperator {tm}, cosint {99}, pdfoperator {tz},, Cosname {f30}, cosint {56}, pdfoperator {tf}, cosint {3}, pdfoperator {tr}, cosfloat {0.334}, pdfoperator {tc}, costra {pospremanj}, pdfoperator {tj}, cosint {0}, Pdfoperator {tc}, costring {e}, pdfoperator {tj}, cosfloat {9.533}, pdfoperator {tw}, costring {i}, pdfoperator {tj}, cosfloat {6.062}, pdfoperator {tw}, cosfloat { Pdfoperator {tc}, costring {ciscenj}, pdfoperator {tj}, cosint {0}, pdfoperator {tc}, costra {e}, pdfoperator {tj}, cosint {1}, cosint {0}, cosint {0},, COSInt{1}, COSFloat{55.68}, COSFloat{539.76}, PDFOperator{Tm}, COSInt{0}, PDFOperator{Tw}, COSFloat{0.262}, PDFOperator{Tc}, COSString{uoè}, PDFOperator{Tj}, Cosint {0}, pdfoperator {tc}, Cosstring {i}, pdfoperator {tj}, cosfloat {5.443}, pdfoperator {tw}, cosfloat {-2.145}, pdfoperator {tc}, cosstring {zimslco}, pdfoperator}, pdj}}}, pdJ}, pdfoperator. , Cosint {0}, PDFoperator {TC}, Cosstring {G}, PDFoperator {TJ}, Cosfloa t {7.202}, pdfoperator {tw}, cosfloat {-0.148}, pdfoperator {tc}, cubstring {odmor}, pdfoperator {tj}, cosint {0}, pdfoperator {tc}, cosstring {a}, pdfoperator {tJ}}}} , PDFoperator {et},

  • Pdfoperator {bt}, cosint {1}, cosint {0}, cosint {0}, cosint {1}, cosfloat {6.72}, cosfloat {513.12}, pdfoperator {tm}, cosint {0}, pdfoperator {tw},, Cosname {F30}, Cosint {14}, PDFoperator {tf}, Cosstring {}, PDFoperator {TJ}, Cosfloat {2.751}, PDFoperator {TW}, ...

Me gustaría obtener la salida algo como la función PrintTextLocations hace para cada palabra/carácter. Puedo coordinar la parte inferior y la izquierda, pero ¿cómo obtener el ancho y la coordenada superior?

PrintTextLocations:

  • cadena [25.92,45.119995 fs = 56.0 xscale = 55.440002 Height = 40.208004 Space = 15.412322 ancho = 36.978485] P String [63.22914,45.119995 FS = 56.0 xscale = 55.440002 altura = 408004 Espacio = 15. = 56.0 xscale = 55.440002 altura = 40.208004 espacio = 15.412322 ancho = 30.824646] s cadena [128.58894,45.1199995 fs = 56.0 xscale = 55.440002 altura = 42.168 espacio = 15.412322 Width = 33.8733844 = 42.168 Space = 15.412322 Width = 21.566162] R String [184.69026,45.1199995 fs = 56.0 xscale = 55.440002 Height = 42.168 Space = 15.412322 Width = 30.824646] E Cadena [215.84557,45.11999999999 = 49.286148] m ...
¿Fue útil?

Solución

... Como la sección BT le ofrece coordenadas de la izquierda, debe analizar todas las palabras/letras contenidas en el bloque BT actual para obtener todas las demás coordenadas. Altura de la primera palabra + Bot Bottom = TOP, MAX (COORDINACIÓN IZQUIERDA + ANCHO) = DERECHA, LISTA PALABRA Bottom = Bottom Coordinate.

Espero que esto ayude a alguien...

Cadena de ejemplo para una sola letra:

string[32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999]p

Línea extraída, analizada y preparada:

32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999

Función:

/**
 * Parse single word / letter element
 *
 * @param string $str_raw  Extracted word string line.
 * @param string $str_elem Element of interest, word, char.
 * @param int    $pdf_w    Pdf page width.
 * @param int    $pdf_h    Pdf page height.
 * @param int    $pdf_d    Pdf page dpi.
 * @param int    $pdf_r    Pdf page relative dpi.
 *
 * @return array
 */
function createRealCoordinates($str_raw, $str_elem, $pdf_w, $pdf_h, $pdf_d = 400, $pdf_r = 72)
{
    $stringstrip = array('fs=', 'xscale=', 'height=', 'space=', 'width=');
    $string_info = str_replace($stringstrip, '', $str_raw);

    $coord_info = explode(' ', $string_info);
    $coord_xy   = explode(',', $coord_info[0]);

    $coord = array(
        'pdfWidth'  => $pdf_w,
        'pdfHeight' => $pdf_h,
        'pdfDpi'    => $pdf_d,
        'pdfRel'    => $pdf_r,
        'word'      => $str_elem,

        'x1' => null,
        'y1' => null,
        'x2' => null,
        'y2' => null,

        'fontSize'     => null,
        'xScale'       => null,
        'HeightDir'    => null,
        'WidthDir'     => null,
        'WidthOfSpace' => null,
    );

    // Left, Bottom coordinate.
    $coord['x1'] = ($coord_xy[0] / $pdf_r) * $pdf_d;
    $coord['y2'] = ($coord_xy[1] / $pdf_r) * $pdf_d;

    $coord['fontSize']     = $coord_info[1]; // font size.
    $coord['xScale']       = $coord_info[2]; // x size scale.
    $coord['HeightDir']    = $coord_info[3]; // height.
    $coord['WidthDir']     = $coord_info[5]; // word width.
    $coord['WidthOfSpace'] = ($coord_info[4] / $pdf_r) * $pdf_d; // width of space.

    // Right, Top coordinate.
    $coord['x2'] = $coord['x1'] + (($coord['WidthDir'] / $pdf_r) * $pdf_d);
    $coord['y1'] = $coord['y2'] - (($coord['HeightDir'] / $pdf_r) * $pdf_d);

    return $coord;
}

-Matija Kancijan

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top