What's the best way(error proof / foolproof) to parse a file using python with following format?

https://stackoverflow.com/questions/493484

20-08-2019
|

Question

########################################
# some comment
# other comment
########################################

block1 {
    value=data
    some_value=some other kind of data
    othervalue=032423432
    }

block2 {
    value=data
    some_value=some other kind of data
    othervalue=032423432
    }

Solution

The best way would be to use an existing format such as JSON.

Here's an example parser for your format:

from lepl import (AnyBut, Digit, Drop, Eos, Integer, Letter,
                  NON_GREEDY, Regexp, Space, Separator, Word)

# EBNF
# name = ( letter | "_" ) , { letter | "_" | digit } ;
name = Word(Letter() | '_',
            Letter() | '_' | Digit())
# words = word , space+ , word , { space+ , word } ;
# two or more space-separated words (non-greedy to allow comment at the end)
words = Word()[2::NON_GREEDY, ~Space()[1:]] > list
# value = integer | word | words  ;
value = (Integer() >> int) | Word() | words
# comment = "#" , { all characters - "\n" } , ( "\n" | EOF ) ;
comment = '#' & AnyBut('\n')[:] & ('\n' | Eos())

with Separator(~Regexp(r'\s*')):
    # statement = name , "=" , value ;
    statement = name & Drop('=') & value > tuple
    # suite     = "{" , { comment | statement } , "}" ;
    suite     = Drop('{') & (~comment | statement)[:] & Drop('}') > dict
    # block     = name , suite ;
    block     = name & suite > tuple
    # config    = { comment | block } ;
    config    = (~comment | block)[:] & Eos() > dict

from pprint import pprint

pprint(config.parse(open('input.cfg').read()))

Output:

[{'block1': {'othervalue': 32423432,
             'some_value': ['some', 'other', 'kind', 'of', 'data'],
             'value': 'data'},
  'block2': {'othervalue': 32423432,
             'some_value': ['some', 'other', 'kind', 'of', 'data'],
             'value': 'data'}}]

OTHER TIPS

Well, the data looks pretty regular. So you could do something like this (untested):

class Block(object):
    def __init__(self, name):
        self.name = name

infile = open(...)  # insert filename here
current = None
blocks = []

for line in infile:
    if line.lstrip().startswith('#'):
        continue
    elif line.rstrip().endswith('{'):
        current = Block(line.split()[0])
    elif '=' in line:
        attr, value = line.strip().split('=')
        try:
            value = int(value)
        except ValueError:
            pass
        setattr(current, attr, value)
    elif line.rstrip().endswith('}'):
        blocks.append(current)

The result will be a list of Block instances, where block.name will be the name ('block1', 'block2', etc.) and other attributes correspond to the keys in your data. So, blocks[0].value will be 'data', etc. Note that this only handles strings and integers as values.

(there is an obvious bug here if your keys can ever include 'name'. You might like to change self.name to self._name or something if this can happen)

HTH!

If you do not really mean parsing, but rather text processing and the input data is really that regular, then go with John's solution. If you really need some parsing (like there are some a little more complex rules to the data that you are getting), then depending on the amount of data that you need to parse, I'd go either with pyparsing or simpleparse. I've tried both of them, but actually pyparsing was too slow for me.

You might look into something like pyparsing.

Grako (for grammar compiler) allows to separate the input format specification (grammar) from its interpretation (semantics). Here's grammar for your input format in Grako's variety of EBNF:

(* a file contains zero or more blocks *)
file = {block} $;
(* a named block has at least one assignment statement *)
block = name '{' {assignment}+ '}';
assignment = name '=' value NEWLINE;
name = /[a-z][a-z0-9_]*/;
value = integer | string;
NEWLINE = /\n/;
integer = /[0-9]+/;
(* string value is everything until the next newline *)
string = /[^\n]+/;

To install grako, run pip install grako. To generate the PEG parser from the grammar:

$ grako -o config_parser.py Config.ebnf

To convert stdin into json using the generated config_parser module:

#!/usr/bin/env python
import json
import string
import sys
from config_parser import ConfigParser

class Semantics(object):
    def file(self, ast):
        # file = {block} $
        # all blocks should have unique names within the file
        return dict(ast)
    def block(self, ast):
        # block = name '{' {assignment}+ '}'
        # all assignment statements should use unique names
        return ast[0], dict(ast[2])
    def assignment(self, ast):
        # assignment = name '=' value NEWLINE
        # value = integer | string
        return ast[0], ast[2] # name, value
    def integer(self, ast):
        return int(ast)
    def string(self, ast):
        return ast.strip() # remove leading/trailing whitespace

parser = ConfigParser(whitespace='\t\n\v\f\r ', eol_comments_re="#.*?$")
ast = parser.parse(sys.stdin.read(), rule_name='file', semantics=Semantics())
json.dump(ast, sys.stdout, indent=2, sort_keys=True)

Output

{
  "block1": {
    "othervalue": 32423432,
    "some_value": "some other kind of data",
    "value": "data"
  },
  "block2": {
    "othervalue": 32423432,
    "some_value": "some other kind of data",
    "value": "data"
  }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow