Question

Yesterday I had to parse a very simple binary data file - the rule is, look for two bytes in a row that are both 0xAA, then the next byte will be a length byte, then skip 9 bytes and output the given amount of data from there. Repeat to the end of the file.

My solution did work, and was very quick to put together (even though I am a C programmer at heart, I still think it was quicker for me to write this in Python than it would have been in C) - BUT, it is clearly not at all Pythonic and it reads like a C program (and not a very good one at that!)

What would be a better / more Pythonic approach to this? Is a simple FSM like this even still the right choice in Python?

My solution:

#! /usr/bin/python

import sys

f = open(sys.argv[1], "rb")

state = 0

if f:
    for byte in f.read():
        a = ord(byte)       
        if state == 0:
            if a == 0xAA:
                state = 1
        elif state == 1:
            if a  == 0xAA:
                state = 2
            else: 
                state = 0
        elif state == 2:
            count = a;
            skip = 9
            state = 3
        elif state == 3:
            skip = skip -1
            if skip == 0:
                state = 4
        elif state == 4:
             print "%02x" %a
             count = count -1 
             if count == 0:
                 state = 0
                 print "\r\n"
Was it helpful?

Solution

You could give your states constant names instead of using 0, 1, 2, etc. for improved readability.

You could use a dictionary to map (current_state, input) -> (next_state), but that doesn't really let you do any additional processing during the transitions. Unless you include some "transition function" too to do extra processing.

Or you could do a non-FSM approach. I think this will work as long as 0xAA 0xAA only appears when it indicates a "start" (doesn't appear in data).

with open(sys.argv[1], 'rb') as f:
    contents = f.read()
    for chunk in contents.split('\xaa\xaa')[1:]:
        length = ord(chunk[0])
        data = chunk[10:10+length]
        print data

If it does appear in data, you can instead use string.find('\xaa\xaa', start) to scan through the string, setting the start argument to begin looking where the last data block ended. Repeat until it returns -1.

OTHER TIPS

The coolest way I've seen to implement FSMs in Python has to be via generators and coroutines. See this Charming Python post for an example. Eli Bendersky also has an excellent treatment of the subject.

If coroutines aren't familiar territory, David Beazley's A Curious Course on Coroutines and Concurrency is a stellar introduction.

I am a little apprehensive about telling anyone what's Pythonic, but here goes. First, keep in mind that in python functions are just objects. Transitions can be defined with a dictionary that has the (input, current_state) as the key and the tuple (next_state, action) as the value. Action is just a function that does whatever is necessary to transition from the current state to the next state.

There's a nice looking example of doing this at http://code.activestate.com/recipes/146262-finite-state-machine-fsm. I haven't used it, but from a quick read it seems like it covers everything.

A similar question was asked/answered here a couple of months ago: Python state-machine design. You might find looking at those responses useful as well.

I think your solution looks fine, except you should replace count = count - 1 with count -= 1.

This is one of those times where fancy code-show-offs will come up ways of have dicts mapping states to callables, with a small driver function, but it isn't better, just fancier, and using more obscure language features.

I suggest checking out chapter 4 of Text Processing in Python by David Mertz. He implements a state machine class in Python that is very elegant.

I think the most pythonic way would by like what FogleBird suggested, but mapping from (current state, input) to a function which would handle the processing and transition.

You can use regexps. Something like this code will find the first block of data. Then it's just a case of starting the next search from after the previous match.

find_header = re.compile('\xaa\xaa(.).{9}', re.DOTALL)
m = find_header.search(input_text)
if m:
    length = chr(find_header.group(1))
    data = input_text[m.end():m.end() + length]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top