有人可以帮助我获得PDF begintext部分的真正像素坐标吗?我正在使用PDFBox从PDF文件中检索文本,但是现在我需要使该文本段/段落的矩形使矩形保持良好状态。

$contents = $page->getContents();
$contentsStream = $page->getContents()->getStream();
$resources=$page->getResources();
$fonts = $resources->getFonts();
$xobjects = $resources->getImages();
$tokens=$contentsStream->getStreamTokens();
  • pdfoperator {q},cosfloat {690.48},cosint {0},cosint {0},cosfloat {633.6},cosint {0},cosint {0},{0},pdfoperator {cm} ,pdfoperator {q},

  • pdfoperator {bt},cosint {1},cosint {0},cosint {0},cosint {1},cosfloat {25.92},cosfloat {588.48},pdfoperator {tm} cosname {f30},cosint {56},pdfoperator {tf},cosint {3},pdfoperator {tr},cosfloat {0.334},pdfoperator {tc} pdfoperator {tc},coSString {e},pdfoperator {tj},cosfloat {9.533},pdfoperator {tw},cosstring {i} pdfoperator {tc},cosstring {ciscenj},pdfoperator {tj},cosint {0},pdfoperator {tc},cosstring {e},pdfoperator {tj} cosint {1},cosfloat {55.68},cosfloat {539.76},pdfoperator {tm},cosInt {0},pdfoperator {tw},cosfloat {0.262},pdfoperator {tc},pdfoperator {tc},pdfop},p.soster},p.sostr Jotrjer {uotrjetrj,fortrjetrj,costrjetrj,fortrjetrj, cosint {0},pdfoperator {tc},cosstring {i},pdfoperator {tj},cosfloat {5.443},pdfoperator {tw},cosfloat {-2.145},pdfoperator {-2.145},pdfoperator {zostra} zytr {zoster {zoser { ,cosint {0},pdfoperator {tc},cosstring {g},pdfoperator {tj},cosfloa t {7.202},pdfoperator {tw},cosfloat {-0.148},pdfoperator {tc},cosstring {odmor},pdfoperator {tj},cosint {0},pdfoperator {0},pdfoperator { ,pdfoperator {et},

  • pdfoperator {bt},cosint {1},cosint {0},cosint {0},cosint {1},cosfloat {6.72},cosfloat {513.12},pdfoperator {tm} cosname {f30},cosint {14},pdfoperator {tf},cosstring {},pdfoperator {tj},cosfloat {2.751},pdfoperator {tw},...

我想获得诸如printTextLocations函数之类的输出对每个单词/字符都可以。我可以得到底部和左协调,但是如何获得宽度和顶部坐标?

printTextLocations:

  • string[25.92,45.119995 fs=56.0 xscale=55.440002 height=40.208004 space=15.412322 width=36.978485]p string[63.22914,45.119995 fs=56.0 xscale=55.440002 height=40.208004 space=15.412322 width=33.87384]o string[97.43364,45.119995 fs = 56.0 xscale = 55.440002高度= 40.208004空间= 15.412322宽度= 30.824646] = 42.168空间= 15.412322宽度= 21.566162] r string [184.69026,45.119995 fs = 56.0 xscale xscale = 55.440002高度= 42.168 Space = 42.168 Space = 15.412322 width = 15.412322 width = 30.824646] e String [255.84557,44557,4457,4457,44557,4457,4457,457,457,44,457,44,457,44,457,44,44,457,44,457,444.44557,44,457,44,457,44; = 49.286148] M ...
有帮助吗?

解决方案

... BT部分为您提供左下的坐标,您需要将当前BT块中包含的所有单词/字母解析以获取所有其他坐标。第一个单词高度 + bt底部= top,max(左坐标 +宽度)=右,最后一个字底部=底部坐标。

我希望这可以帮助别人...

单个字母的示例字符串:

string[32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999]p

提取,解析和准备好线:

32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999

功能:

/**
 * Parse single word / letter element
 *
 * @param string $str_raw  Extracted word string line.
 * @param string $str_elem Element of interest, word, char.
 * @param int    $pdf_w    Pdf page width.
 * @param int    $pdf_h    Pdf page height.
 * @param int    $pdf_d    Pdf page dpi.
 * @param int    $pdf_r    Pdf page relative dpi.
 *
 * @return array
 */
function createRealCoordinates($str_raw, $str_elem, $pdf_w, $pdf_h, $pdf_d = 400, $pdf_r = 72)
{
    $stringstrip = array('fs=', 'xscale=', 'height=', 'space=', 'width=');
    $string_info = str_replace($stringstrip, '', $str_raw);

    $coord_info = explode(' ', $string_info);
    $coord_xy   = explode(',', $coord_info[0]);

    $coord = array(
        'pdfWidth'  => $pdf_w,
        'pdfHeight' => $pdf_h,
        'pdfDpi'    => $pdf_d,
        'pdfRel'    => $pdf_r,
        'word'      => $str_elem,

        'x1' => null,
        'y1' => null,
        'x2' => null,
        'y2' => null,

        'fontSize'     => null,
        'xScale'       => null,
        'HeightDir'    => null,
        'WidthDir'     => null,
        'WidthOfSpace' => null,
    );

    // Left, Bottom coordinate.
    $coord['x1'] = ($coord_xy[0] / $pdf_r) * $pdf_d;
    $coord['y2'] = ($coord_xy[1] / $pdf_r) * $pdf_d;

    $coord['fontSize']     = $coord_info[1]; // font size.
    $coord['xScale']       = $coord_info[2]; // x size scale.
    $coord['HeightDir']    = $coord_info[3]; // height.
    $coord['WidthDir']     = $coord_info[5]; // word width.
    $coord['WidthOfSpace'] = ($coord_info[4] / $pdf_r) * $pdf_d; // width of space.

    // Right, Top coordinate.
    $coord['x2'] = $coord['x1'] + (($coord['WidthDir'] / $pdf_r) * $pdf_d);
    $coord['y1'] = $coord['y2'] - (($coord['HeightDir'] / $pdf_r) * $pdf_d);

    return $coord;
}

-Matija Kancijan

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top