How should I organise my functions with pyparsing?

https://stackoverflow.com/questions/1866329

18-09-2019
|

Question

I am parsing a file with python and pyparsing (it's the report file for PSAT in Matlab but that isn't important). here is what I have so far. I think it's a mess and would like some advice on how to improve it. Specifically, how should I organise my grammar definitions with pyparsing?

Should I have all my grammar definitions in one function? If so, it's going to be one huge function. If not, then how do I break it up. At the moment I have split it at the sections of the file. Is it worth making loads of functions that only ever get called once from one place. Neither really feels right to me.

Should I place all my input and output code in a separate file to the other class functions? It would make the purpose of the class much clearer.

I'm also interested to know if there is an easier way to parse a file, do some sanity checks and store the data in a class. I seem to spend a lot of my time doing this.

(I will accept answers of it's good enough or use X rather than pyparsing if people agree)

Solution

I could go either way on using a single big method to create your parser vs. taking it in steps the way you have it now.

I can see that you have defined some useful helper utilities, such as slit ("suppress Literal", I presume), stringtolits, and decimaltable. This looks good to me.

I like that you are using results names, they really improve the robustness of your post-parsing code. I would recommend using the shortcut form that was added in pyparsing 1.4.7, in which you can replace

busname.setResultsName("bus1")

with

busname("bus1")

This can declutter your code quite a bit.

I would look back through your parse actions to see where you are using numeric indexes to access individual tokens, and go back and assign results names instead. Here is one case, where GetStats returns (ngroup + sgroup).setParseAction(self.process_stats). process_stats has references like:

self.num_load = tokens[0]["loads"]
self.num_generator = tokens[0]["generators"]
self.num_transformer = tokens[0]["transformers"]
self.num_line = tokens[0]["lines"]
self.num_bus = tokens[0]["buses"]
self.power_rate = tokens[1]["rate"]

I like that you have Group'ed the values and the stats, but go ahead and give them names, like "network" and "soln". Then you could write this parse action code as (I've also converted to the - to me - easier-to-read object-attribute notation instead of dict element notation):

self.num_load = tokens.network.loads
self.num_generator = tokens.network.generators
self.num_transformer = tokens.network.transformers
self.num_line = tokens.network.lines
self.num_bus = tokens.network.buses
self.power_rate = tokens.soln.rate

Also, a style question: why do you sometimes use the explicit And constructor, instead of using the '+' operator?

busdef = And([busname.setResultsName("bus1"),
            busname.setResultsName("bus2"),
            integer.setResultsName("linenum"),
            decimaltable("pf qf pl ql".split())])

This is just as easily written:

busdef = (busname("bus1") + busname("bus2") + 
            integer("linenum") + 
            decimaltable("pf qf pl ql".split()))

Overall, I think this is about par for a file of this complexity. I have a similar format (proprietary, unfortunately, so cannot be shared) in which I built the code in pieces similar to the way you have, but in one large method, something like this:

def parser():
    header = Group(...)
    inputsummary = Group(...)
    jobstats = Group(...)
    measurements = Group(...)
    return header("hdr") + inputsummary("inputs") + jobstats("stats") + measurements("meas")

The Group constructs are especially helpful in a large parser like this, to establish a sort of namespace for results names within each section of the parsed data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow