pyPdf 에 대한 추출 IndirectObject

https://stackoverflow.com/questions/436474

22-07-2019
|

문제

다음과 같은 이 예제할 수 있는 목록 모든 요소는 pdf 파일로

import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects

지금,나를 추출하는 데 필요한 표준이 아닌 객체에서 pdf 파일입니다.

내체가 하나라는 MYOBJECT 및 그것의 문자열입니다.

조각에 의해 인쇄 python 스크립트는 concernes me:

{'/MYOBJECT': IndirectObject(584, 0)}

Pdf 파일을 이:

558 0 obj
<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources
  <</ColorSpace <</CS0 563 0 R>>
    /ExtGState <</GS0 568 0 R>>
    /Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>
    /ProcSet[/PDF/Text/ImageC]
    /Properties<</MC0<</MYOBJECT 584 0 R>>/MC1<</SubKey 582 0 R>> >>
    /XObject<</Im0 578 0 R>>>>
  /Rotate 0/StructParents 0/Type/Page>>
endobj
...
...
...
584 0 obj
<</Length 8>>stream

1_22_4_1     --->>>>  this is the string I need to extract from the object

endstream
endobj

할 수 있는 방법에 따라 584 값기 위해서를 참조하는 내열(아래 pyPdf 물론)은 무엇입니까??

해결책

각 요소에 pdf.pages 은 사전에,그래서 가정에서도 페이지 1, pdf.pages[0]['/MYOBJECT'] 해야할 요소.

할 수 있는 인쇄하려고 하는 개별적으로 또는에 찌를로 help 고 dir 에서는 파이썬 프롬프트를 얻을하는 방법에 대한 자세한 문자열을 원

편집:

을 받은 후 복사본을 pdf,내가 발견에 있는 객체에 pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT'] 과 값을 검색할 수 있습을 통해 getData()

다음과 같은 기능을 제공합니다.를 해결하는 일반적인 방법을 이해 재귀적으로 찾고 키에 대해서 질문

import types
import pyPdf
pdf = pyPdf.PdfFileReader(open('file.pdf'))
pages = list(pdf.pages)

def findInDict(needle,haystack):
    for key in haystack.keys():
        try:
            value = haystack[key]
        except:
            continue
        if key == needle:
            return value
        if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):  
            x = findInDict(needle,value)
            if x is not None:
                return x

answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()

다른 팁

는 IndirectObject 참조하여 실체는(그것은 다음과 같은 링크 또는 별칭이도록 전체 크기의 PDF 파일이 감소할 수 있습 같은 내용이 나타나 여러 곳에서).합할 방법을 줄 것이다 당신은 실제 개체입니다.

개체의 경우에는 텍스트 객체,그냥 하 str()또는 유니코드()에 객체를 얻어야 한다 당신은 내부 데이터의습니다.

또한,pyPdf 상점에있는 개체 resolvedObjects 특성이 있습니다.예를 들어,PDF 파일을 포함하는 이체:

13 0 obj
<< /Type /Catalog /Pages 3 0 R >>
endobj

읽을 수 있습니다:

>>> import pyPdf
>>> pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
>>> pages = list(pdf.pages)
>>> pdf.resolvedObjects
{0: {2: {'/Parent': IndirectObject(3, 0), '/Contents': IndirectObject(4, 0), '/Type': '/Page', '/Resources': IndirectObject(6, 0), '/MediaBox': [0, 0, 595.2756, 841.8898]}, 3: {'/Kids': [IndirectObject(2, 0)], '/Count': 1, '/Type': '/Pages', '/MediaBox': [0, 0, 595.2756, 841.8898]}, 4: {'/Filter': '/FlateDecode'}, 5: 147, 6: {'/ColorSpace': {'/Cs1': IndirectObject(7, 0)}, '/ExtGState': {'/Gs2': IndirectObject(9, 0), '/Gs1': IndirectObject(10, 0)}, '/ProcSet': ['/PDF', '/Text'], '/Font': {'/F1.0': IndirectObject(8, 0)}}, 13: {'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}}}
>>> pdf.resolvedObjects[0][13]
{'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}

Jehiah 의 방법이 좋으면 모든 곳에서 찾고 있다.나의 추측해(PDF)는 항상 같은 장소(첫 페이지에서'MC0'등),그리고 훨씬 더 간단한 방법을 찾는 문자열이 될 것이다:

import pyPdf
pdf = pyPdf.PdfFileReader(open("file.pdf"))
pdf.getPage(0)['/Resources']['/Properties']['/MC0']['/MYOBJECT'].getData()

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow