extracting stream from pdf in python

https://stackoverflow.com/questions/429437

06-07-2019
|

Question

How can I extract the part of this stream (the one named BLABLABLA) from the pdf file which contains it??

<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0  /Resources<</ColorSpace<</CS0 563 0 R>>/ExtGState<</GS0 568 0 R>>/Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>/ProcSet[/PDF/Text/ImageC]/Properties<</MC0<</BLABLABLA 584 0 R>>/MC1<</SubKey 582 0 R>>>>/XObject<</Im0 578 0 R>>>>/Rotate 0/StructParents 0/Type/Page>>

Or, in other worlds, how can I extract a subkey from a pdf stream?

I would like to use some python's library (like pyPdf or ReportLab), but even some C/C++ lib should go well for me.

Can anyone help me?

Solution

IIUC, a stream in a PDF is just a sequence of binary data. I think you are wanting to extract part of an object. Are you wanting a standard object, like an image or text? It would be a lot easier to give you example code if there was a real example.

This might help get you started:

import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow