If the files are small enough so you can load them in memory, you can treat them as regular strings, and use the find
method (see here) to navigate it.
Let's go to the worse case scenario: You don't have guarantee that your header will be the first thing in the file, and you might have more than one body (more than one <header><body><footer>
block) I have created a file called bindata.txt
with the following content:
ABCD000100a0AAAAAA000000000000ABCDABCD000100a0BBBBBB000000000000ABCD
Ok, there are two bodies, first one being AAAAAA
and the second BBBBBB
and some junk in the beginning, middle and end (ABCD
before the first header, ABCDABCD
before the second header and ABCD
after the second footer)
Playing with the find
method of the str
object and the indexes, here's what I came up with:
header = "000100a0"
footer = "00000000000"
with open('bindata.txt', 'r') as f:
data = f.read()
print "Data: %s" % data
header_index = data.find(header, 0)
footer_index = data.find(footer, 0)
if header_index >= 0 and footer_index >= header_index:
print "Found header at %s and footer at %s" \
% (header_index, footer_index)
body = data[header_index + len(header): footer_index]
while body is not None:
print "body: %s" % body
header_index = data.find(header,\
footer_index + len(footer))
footer_index = data.find(footer,\
footer_index + len(footer) + len(header) )
if header_index >= 0 and footer_index >= header_index:
print "Found header at %s and footer at %s" \
% (header_index, footer_index)
body = data[header_index + len(header): footer_index]
else:
body = None
That outputs:
Data: ABCD000100a0AAAAAA000000000000ABCDABCD000100a0BBBBBB000000000000ABCD
Found header at 4 and footer at 18
body: AAAAAA
Found header at 38 and footer at 52
body: BBBBBB
If your files are too big to keep in memory, I think the best is read the file byte by byte and create a couple of functions to find where the header ends and the footer starts using the file's seek and tell methods.
EDIT:
As per OP's request, method without having to hexlify (using raw binary) and using seek and tell:
import os
import binascii
import mmap
header = binascii.unhexlify("000100a0")
footer = binascii.unhexlify("0000000000")
sample = binascii.unhexlify("ABCD"
"000100a0AAAAAA000000000000"
"ABCDABCD"
"000100a0BBBBBB000000000000"
"ABCD")
# Create the sample file:
with open("sample.data", "wb") as f:
f.write(sample)
# sample done. Now we have a REAL binary data in sample.data
with open('sample.data', 'rb') as f:
print "Data: %s" % binascii.hexlify(f.read())
mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
current_offset = 0
header_index = mm.find(header, current_offset)
footer_index = mm.find(footer, current_offset + len(header))
if header_index >= 0 and footer_index > header_index:
print "Found header at %s and footer at %s"\
% (header_index, footer_index)
mm.seek(header_index + len(header))
body = mm.read(footer_index - mm.tell())
while body is not None:
print "body: %s" % binascii.hexlify(body)
current_offset = mm.tell()
header_index = mm.find(header, current_offset + len(footer))
footer_index = mm.find(footer, current_offset + len(footer) + len(header))
if header_index >= 0 and footer_index > header_index:
print "Found header at %s and footer at %s"\
% (header_index, footer_index)
mm.seek(header_index + len(header))
body = mm.read(footer_index - mm.tell())
else:
body = None
This method produces the following output:
Data: abcd000100a0aaaaaa000000000000abcdabcd000100a0bbbbbb000000000000abcd
Found header at 2 and footer at 9
body: aaaaaa
Found header at 19 and footer at 26
body: bbbbbb
Note that I used Python's mmap module to help move through the file. Please take a look to its documentation. Also, the first part of this example contains some data to create an actual binary file in sample.data
. The execution of the chunk:
# Create the sample file:
with open("sample.data", "wb") as f:
f.write(sample)
Produces the following (really human-readable) file:
borrajax@borrajax:~/Documents/Tests$ cat ./sample.data
�������ͫ�������