Parsing deleted pdfs

Question 1

Since PDF's are "free format" (pretty much like text files, but with less obviousness to humans when it comes to "reading" the content), it's probably hard to piece them together if they aren't in order.

A stream does have a length, which is a key to where the endstream goes. (A blank line before and after the stream itself). Streams are used t introduce bitmaps and similar things [fonts, line-art data in compressed form, etc] into the document). But if you have several 4KB segments that could go in as the same block in the middle of a stream then there's no way to tell which way they go, other than pasting it together and seeing which ones look sane and which doesn't. Similarly, if there are several segments of streams and objects, you can't really tell which goes where.

Of course, this applies to almost all types of files with "variable content" - you can find the first few kilobytes of a JPG, but knowing what the REST of the of is, won't be easy - only be visually inspecting the content can you determine which blocks of bytes belong where - if you get it wrong, you'll probably just get some random garbage.

Question 2

The open source tool bulk_extractor has a module called scan_pdf that does pretty much what you are describing here. It can recognize the individual parts of a PDF file on a drive, automatically decompresses the compressed regions, and extracts text using a two strategies. It will recover data from fragments of PDFs even if the xref table cannot be found.