Question

I have been able to copy the raw data from an otherwise inaccessible USB drive into a monolithic file of about 250MB. Somewhere in that blob of bytes are about 40 Word documents.

  1. Where do I find documentation about the internal structure of Word documents such that I can parse the byte-stream, recognise where a Word doc starts and finishes and extract a copy?

  2. Are there any libraries in any programming language specific to this task?

  3. Can anyone suggest an already existing software solution to this issue?

Was it helpful?

Solution

Two approaches:

You can mount files as volumes in linux. Provided your binary blob isn't too corrupted, you'll probably be able to break down the filesystem to find out where you files are located. Is (was) it a FAT partition or NTFS?

If that doesn't work, I'd look for this string of bytes:

D0 CF 11 E0 A1 B1 1A E1

These are the "magic bytes" of office document file signatures. They might occur randomly in other data, but it's a start. You're going to run into MAJOR issues if the files are fragmented.

Also, try to recreate pieces of the document(s) in Word as is, save it to a file and extract chunks to search for in the blob (using grep binary or whatever). Provided you have info from all parts of the file you should be able to decode WHERE in the blob they are. Piecing it back into a working DOC binary seems far fetched, but recovering the rest of the text shouldn't be impossible.

OTHER TIPS

The Apache POI project has a library for reading and writing all kinds of MS Office docs. If the files are in the new XML base OOXML format, you'll be looking for the start of a zip file as the XML is compressed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top