How can I interpret a legacy binary data file without documentation?

https://stackoverflow.com/questions/1801978

05-07-2019
|

Question

Data is often stored in program-specific binary files for which there is little or no documentation. A typical example in our field is data that comes from an instrument, but I suspect the problem is general. What methods are there for trying to understand and interpret the data?

To set some boundaries. The files are not encrypted and there is no DRM. The type and format of the file is specific to the writer of the program (i.e. it is not a "standard file" - such as *.tar - whose identity has been lost). There is (probably) no deliberate obfuscation but there may be some amateur efforts to save space. We can assume that we have a general knowledge of what the data is and we may recognize some, but probably not all, of the fields and arrays.

Assume that the majority of the data is numeric, with scalars, and arrays (probably 1- and 2- dimensional and sometimes irregular or triangular). There will also be some character strings, probably names of people, sites, dates and maybe some keywords. There will be code in the program that reads the binary file, but we do not have access to the source or the assembler. As an example it may have been written by a VAX Fortran program or some early Unix or by Windows as OLE objects. The numbers may be big- or little-endian (which is not known at the start) but it's probably consistent. We may have different versions on different machines (e.g. Cray).

We can assume we have a reasonably large corpus of files - some hundreds, say.

We can assume two scenarios:

We can rerun the program with different inputs so we can do experiments.
We cannot rerun the program - we have a fixed set of documents. This has a gentle similarity to decoding historical documents in an unknown language (e.g. Linear B).

A partial solution may be acceptable - i.e. there may be some fields that no living person now understands, but most of the others are interpretable.

I am only interested in Open Source approaches.

UPDATE There is a related SO question (How to reverse engineer binary file formats for compatibility purposes) but the emphasis is somewhat different. UPDATE Clever suggestion from @brianegge to address (1). Use truss (or possibly strace on Linux) to dump all write() and similar calls in the program. This should allow at least the collection of records written to disk.

Solution

This is an interesting question, I think the answer is that reverse-engineering binary formats is an aquired skill, but there are tools out there that can help.

One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.

It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need. You said you're only interested in open-source solutions, but this is Stackoverflow and someone else might not be so picky.

OTHER TIPS

all files have a header. Start from there, see what similarities you have between 2 files, eliminate common "signatures" and work with the differences. They should mark the number of records, export date and similar things.

Common parts between the two headers may just be considered general signatures and i guess you can ignore them

If you are on a system which offers truss, simply watch your system calls to write and you'll probably have a good idea. It's also possible that the program is going to mmap a file and copy directly from memory, but that's less common.

$ truss -t write echo foo
foowrite(1, " f o o", 3)                                = 3
write(1, "\n", 1)                               = 1

It also may make sense to take a look at the binary. On Unix systems, you can use objdump to view the layout of the binary. This will point to the code and data sections. You can then open the binary is a hex editor and go to the specific offsets. You may be interested in my tips for Solaris binary files.

Diff 2 or more files to look for similarities. This often helps you identify header blocks and different sections of the file.
Endianness is usually pretty easy to work out - more-significant bytes tend to be zero a lot more often than less-significant ones, so if you see a pattern like "00 78" or "78 00" you can make a good guess at which byte is the msb. However, this is only of any help when you have worked out (roughly) what the preceeding data is, so that you know how the data is aligned.
Look for easily identified data - strings are the first place to start because you can spot them easily. These often give you clues, as they are usually embedded near related data, used as stanadard items in headers, etc. If the strings are unicode then you will usually see the letters of the text separated by zero bytes, which will help you identify endianness, and data alignment at that point in the data.
A common format approach (like IFF) is to store chunks of data, each with a small header (e.g. a 2 or 4 byte ID, then a 2 or 4 byte size for the block, then the data of the block). In general people use meaningful (to them) chunk IDs, so they can be easy to spot - If you find what looks like a tag, check the following data to see if it looks like a length (look that many bytes on in the data to see if it looks like there is another header). If you can identify such a format, you break the "one large file" problem down into a "many small files" problem whichmakes it much easier. (However, a lot of device data tends to be "optimised" to make it compact, in which case programmers often throw away convenient extensible formats and cram everything together, packing bits and generally making things much more difficult for you)
Look for known values. If your device is displaying "temperature: 40" then it's possible that you will find that value directly stored in the file. (It's also common to use scaling factors or fixed-point values, so 40 may be represented as (e.g.) 40*10 = 400 or 40*256 = 10240 though)
If you can control the device enough: create some simple files. What you're trying to achieve is the smallest files you can get out of the device to minimise the data you have to examine. Then make a change on the device that causes the file to change - try to minimise the number of changes - and grab the file again. If the file format is "open" (not compressed or encrypted) then you should be able to identify the bytes that have changed.
If you can "load" files back onto the device you may also be able to create your own files, just changing one value to see if you can notice any change of behaviour on the device. If you manage to hit simple values this can work well, but often you may find you just break the file format and the device won't be able to read ther data at all.

I was hoping there was a magic utility that could work out patterns, try different endianness etc. But there doesn't seem to be!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow