Question

I have a config file that is in the following form:

protocol sample_thread {
    { AUTOSTART 0 }
    { BITMAP thread.gif }
    { COORDS {0 0} }
    { DATAFORMAT {
        { TYPE hl7 }
        { PREPROCS {
            { ARGS {{}} }
            { PROCS sample_proc }
        } }
    } } 
}

The real file may not have these exact fields, and I'd rather not have to describe the the structure of the data is to the parser before it parses.

I've looked for other configuration file parsers, but none that I've found seem to be able to accept a file of this syntax.

I'm looking for a module that can parse a file like this, any suggestions?

If anyone is curious, the file in question was generated by Quovadx Cloverleaf.

Was it helpful?

Solution

pyparsing is pretty handy for quick and simple parsing like this. A bare minimum would be something like:

import pyparsing
string = pyparsing.CharsNotIn("{} \t\r\n")
group = pyparsing.Forward()
group << pyparsing.Group(pyparsing.Literal("{").suppress() + 
                         pyparsing.ZeroOrMore(group) + 
                         pyparsing.Literal("}").suppress()) 
        | string

toplevel = pyparsing.OneOrMore(group)

The use it as:

>>> toplevel.parseString(text)
['protocol', 'sample_thread', [['AUTOSTART', '0'], ['BITMAP', 'thread.gif'], 
['COORDS', ['0', '0']], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', 
[['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]]

From there you can get more sophisticated as you want (parse numbers seperately from strings, look for specific field names etc). The above is pretty general, just looking for strings (defined as any non-whitespace character except "{" and "}") and {} delimited lists of strings.

OTHER TIPS

Taking Brian's pyparsing solution another step, you can create a quasi-deserializer for this format by using the Dict class:

import pyparsing

string = pyparsing.CharsNotIn("{} \t\r\n")
# use Word instead of CharsNotIn, to do whitespace skipping
stringchars = pyparsing.printables.replace("{","").replace("}","")
string = pyparsing.Word( stringchars )
# define a simple integer, plus auto-converting parse action
integer = pyparsing.Word("0123456789").setParseAction(lambda t : int(t[0]))
group = pyparsing.Forward()
group << ( pyparsing.Group(pyparsing.Literal("{").suppress() +
    pyparsing.ZeroOrMore(group) +
    pyparsing.Literal("}").suppress())
    | integer | string )

toplevel = pyparsing.OneOrMore(group)

sample = """
protocol sample_thread {
    { AUTOSTART 0 }
    { BITMAP thread.gif }
    { COORDS {0 0} }
    { DATAFORMAT {
        { TYPE hl7 }
        { PREPROCS {
            { ARGS {{}} }
            { PROCS sample_proc }
        } }
    } } 
    }
"""

print toplevel.parseString(sample).asList()

# Now define something a little more meaningful for a protocol structure, 
# and use Dict to auto-assign results names
LBRACE,RBRACE = map(pyparsing.Suppress,"{}")
protocol = ( pyparsing.Keyword("protocol") + 
             string("name") + 
             LBRACE + 
             pyparsing.Dict(pyparsing.OneOrMore(
                pyparsing.Group(LBRACE + string + group + RBRACE)
                ) )("parameters") + 
             RBRACE )

results = protocol.parseString(sample)
print results.name
print results.parameters.BITMAP
print results.parameters.keys()
print results.dump()

Prints

['protocol', 'sample_thread', [['AUTOSTART', 0], ['BITMAP', 'thread.gif'], ['COORDS', 

[0, 0]], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]]
sample_thread
thread.gif
['DATAFORMAT', 'COORDS', 'AUTOSTART', 'BITMAP']
['protocol', 'sample_thread', [['AUTOSTART', 0], ['BITMAP', 'thread.gif'], ['COORDS', [0, 0]], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]]
- name: sample_thread
- parameters: [['AUTOSTART', 0], ['BITMAP', 'thread.gif'], ['COORDS', [0, 0]], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]
  - AUTOSTART: 0
  - BITMAP: thread.gif
  - COORDS: [0, 0]
  - DATAFORMAT: [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]

I think you will get further faster with pyparsing.

-- Paul

I'll try and answer what I think is the missing question(s)...

Configuration files come in many formats. There are well known formats such as *.ini or apache config - these tend to have many parsers available.

Then there are custom formats. That is what yours appears to be (it could be some well-defined format you and I have never seen before - but until you know what that is it doesn't really matter).

I would start with the software this came from and see if they have a programming API that can load/produce these files. If nothing is obvious give Quovadx a call. Chances are someone has already solved this problem.

Otherwise you're probably on your own to create your own parser.

Writing a parser for this format would not be terribly difficult assuming that your sample is representative of a complete example. It's a hierarchy of values where each node can contain either a value or a child hierarchy of values. Once you've defined the basic types that the values can contain the parser is a very simple structure.

You could write this reasonably quickly using something like Lex/Flex or just a straight-forward parser in the language of your choosing.

You can easily write a script in python which will convert it to python dict, format looks almost like hierarchical name value pairs, only problem seems to be Coards {0 0}, where {0 0} isn't a name value pair, but a list so who know what other such cases are in the format I think your best bet is to have spec for that format and write a simple python script to read it.

Your config file is very similar to JSON (pretty much, replace all your "{" and "}" with "[" and "]"). Most languages have a built in JSON parser (PHP, Ruby, Python, etc), and if not, there are libraries available to handle it for you.

If you can not change the format of the configuration file, you can read all file contents as a string, and replace all the "{" and "}" characters via whatever means you prefer. Then you can parse the string as JSON, and you're set.

I searched a little on the Cheese Shop, but I didn't find anything helpful for your example. Check the Examples page, and this specific parser ( it's syntax resembles yours a bit ). I think this should help you write your own.

Look into LEX and YACC. A bit of a learning curve, but they can generate parsers for any language.

Maybe you could write a simple script that will convert your config into xml file and then read it just using lxml, Beatuful Soup or anything else? And your converter could use PyParsing or regular expressions for example.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top