Question

I basically have a file that I wish search for a specific hex value (header), once this is found, read everything from that hex value location (header) until a specific hex value (footer) is found

I have some starting code:

import binascii

holdhd = ""
holdft = ""
header = "03AABBCC"
footer = "FF00FFAA"

with open ('hexfile', 'rb') as file:    
    bytes = file.read()
    a = binascii.hexlify(bytes)     
    while header in a:      
        holdhd = header     
        print holdhd

This will print out the header I wish to find successfully (there are multiple headers in the file), however I am unsure on how to proceed with reading the file from this point and print out everything until the footer is found.

Thanks in advance

Was it helpful?

Solution 2

If the files are small enough so you can load them in memory, you can treat them as regular strings, and use the find method (see here) to navigate it.

Let's go to the worse case scenario: You don't have guarantee that your header will be the first thing in the file, and you might have more than one body (more than one <header><body><footer> block) I have created a file called bindata.txt with the following content:

ABCD000100a0AAAAAA000000000000ABCDABCD000100a0BBBBBB000000000000ABCD

Ok, there are two bodies, first one being AAAAAA and the second BBBBBB and some junk in the beginning, middle and end (ABCD before the first header, ABCDABCD before the second header and ABCD after the second footer)

Playing with the find method of the str object and the indexes, here's what I came up with:

header = "000100a0"
footer = "00000000000"

with open('bindata.txt', 'r') as f:
    data = f.read()
    print "Data: %s" % data
    header_index = data.find(header, 0)
    footer_index = data.find(footer, 0)
    if header_index >= 0 and footer_index >= header_index:
        print "Found header at %s and footer at %s" \
              % (header_index, footer_index)
        body = data[header_index + len(header): footer_index]
        while body is not None:
            print "body: %s" % body
            header_index = data.find(header,\
                                     footer_index + len(footer))
            footer_index = data.find(footer,\
                                     footer_index + len(footer) + len(header) )
            if header_index >= 0 and footer_index >= header_index:
                print "Found header at %s and footer at %s" \
                       % (header_index, footer_index)
                body = data[header_index + len(header): footer_index]
            else:
                body = None

That outputs:

Data: ABCD000100a0AAAAAA000000000000ABCDABCD000100a0BBBBBB000000000000ABCD
Found header at 4 and footer at 18
body: AAAAAA
Found header at 38 and footer at 52
body: BBBBBB

If your files are too big to keep in memory, I think the best is read the file byte by byte and create a couple of functions to find where the header ends and the footer starts using the file's seek and tell methods.

EDIT:

As per OP's request, method without having to hexlify (using raw binary) and using seek and tell:

import os
import binascii
import mmap

header = binascii.unhexlify("000100a0")
footer = binascii.unhexlify("0000000000")
sample = binascii.unhexlify("ABCD"
                "000100a0AAAAAA000000000000"
                "ABCDABCD"
                "000100a0BBBBBB000000000000"
                "ABCD")

# Create the sample file:
with open("sample.data", "wb") as f:
    f.write(sample)

# sample done. Now we have a REAL binary data in sample.data

with open('sample.data', 'rb') as f:
    print "Data: %s" % binascii.hexlify(f.read())
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    current_offset = 0
    header_index = mm.find(header, current_offset)
    footer_index = mm.find(footer, current_offset + len(header))
    if header_index >= 0 and footer_index > header_index:
        print "Found header at %s and footer at %s"\
              % (header_index, footer_index)
        mm.seek(header_index + len(header))
        body = mm.read(footer_index - mm.tell())
        while body is not None:
            print "body: %s" % binascii.hexlify(body)
            current_offset = mm.tell()
            header_index = mm.find(header, current_offset + len(footer))
            footer_index = mm.find(footer, current_offset + len(footer) + len(header))
            if header_index >= 0 and footer_index > header_index:
                print "Found header at %s and footer at %s"\
                    % (header_index, footer_index)
                mm.seek(header_index + len(header))
                body = mm.read(footer_index - mm.tell())
            else:
                body = None

This method produces the following output:

Data: abcd000100a0aaaaaa000000000000abcdabcd000100a0bbbbbb000000000000abcd
Found header at 2 and footer at 9
body: aaaaaa
Found header at 19 and footer at 26
body: bbbbbb

Note that I used Python's mmap module to help move through the file. Please take a look to its documentation. Also, the first part of this example contains some data to create an actual binary file in sample.data. The execution of the chunk:

# Create the sample file:
with open("sample.data", "wb") as f:
    f.write(sample)

Produces the following (really human-readable) file:

borrajax@borrajax:~/Documents/Tests$ cat ./sample.data 
�������ͫ�������

OTHER TIPS

Given the file size, you might want to load everything into memory (keeping data as bytes), then use a regex to extract the part between header and footer, eg:

import binascii
import re

header = binascii.unhexlify('000100a0')
footer = binascii.unhexlify('00000000000')

with open('hexfile', 'rb') as fin:
    raw_data = fin.read()

data = re.search('{}(.*?){}'.format(re.escape(header), re.escape(footer)), raw_data).group(1)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top