Вопрос

I am trying to parse a file using the amazing python library pyparsing but I am having a lot of problems...

The file I am trying to parse is something like:

sectionOne:
  list:
  - XXitem
  - XXanotherItem
  key1: value1
  product: milk
  release: now
  subSection:
    skey : sval
    slist:
    - XXitem
  mods:
  - XXone
  - XXtwo
  version: last
sectionTwo:
  base: base-0.1
  config: config-7.0-7

As you can see is an indented configuration file, and this is more or less how I have tried to define the grammar

  • The file can have one or more sections
  • Each section is formed by a section name and a section content.
  • Each section have an indented content
  • Each section content can have one or more pairs of key/value or a subsection.
  • Each value can be just a single word or a list of items.
  • A list of items is a group of one or more items.
  • Each item is an HYPHEN + a name starting with 'XX'

I have tried to create this grammar using pyparsing but with no success.

import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")

list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))

key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)

pair = pyparsing.Group(key + value)

indentStack = [1]

section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
section_value = pyparsing.OneOrMore(pair | section)
section_content = pyparsing.indentedBlock(section_value, indentStack, True)

section << pyparsing.Group(section_name + section_content)

parser = pyparsing.OneOrMore(section)

def main():
    try:
        with open('simple.info', 'r') as content_file:
            content = content_file.read()

            print "content:\n", content
            print "\n"
            result = parser.parseString(content)
            print "result1:\n", result
            print "len", len(result)

            pprint.pprint(result.asList())
    except pyparsing.ParseException, err:
        print err.line
        print " " * (err.column - 1) + "^"
        print err
    except pyparsing.ParseFatalException, err:
        print err.line
        print " " * (err.column - 1) + "^"
        print err


if __name__ == '__main__':
    main()

This is the result :

result1:
  [['sectionOne', [[['list', ['XXitem', 'XXanotherItem']], ['key1', 'value1'], ['product', 'milk'], ['release', 'now'], ['subSection', [[['skey', 'sval'], ['slist', ['XXitem']], ['mods', ['XXone', 'XXtwo']], ['version', 'last']]]]]]], ['sectionTwo', [[['base', 'base-0.1'], ['config', 'config-7.0-7']]]]]
  len 2
  [
     ['sectionOne',
     [[
        ['list', ['XXitem', 'XXanotherItem']],
        ['key1', 'value1'],
        ['product', 'milk'],
        ['release', 'now'],
        ['subSection',
           [[
              ['skey', 'sval'],
              ['slist', ['XXitem']],
              ['mods', ['XXone', 'XXtwo']],
              ['version', 'last']
           ]]
        ]
     ]]
     ],
     ['sectionTwo', 
     [[
        ['base', 'base-0.1'], 
        ['config', 'config-7.0-7']
     ]]
     ]
  ]

As you can see I have two main problems:

1.- Each section content is nested twice into a list

2.- the key "version" is parsed inside the "subSection" when it belongs to the "sectionOne"

My real target is to be able to get a structure of python nested dictionaries with the keys and values to easily extract the info for each field, but the pyparsing.Dict is something obscure to me.

Could anyone please help me ?

Thanks in advance

( sorry for the long post )

Это было полезно?

Решение

You really are pretty close - congrats, indented parsers are not the easiest to write with pyparsing.

Look at the commented changes. Those marked with 'A' are changes to fix your two stated problems. Those marked with 'B' add Dict constructs so that you can access the parsed data as a nested structure using the names in the config.

The biggest culprit is that indentedBlock does some extra Group'ing for you, which gets in the way of Dict's name-value associations. Using ungroup to peel that away lets Dict see the underlying pairs.

Best of luck with pyparsing!

import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")

list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))

key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)

#~ A: pair = pyparsing.Group(key + value)
pair = (key + value)

indentStack = [1]

section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
#~ A: section_value = pyparsing.OneOrMore(pair | section)
section_value = (pair | section)

#~ B: section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section_content = pyparsing.Dict(pyparsing.ungroup(pyparsing.indentedBlock(section_value, indentStack, True)))

#~ A: section << Group(section_name + section_content)
section << (section_name + section_content)

#~ B: parser = pyparsing.OneOrMore(section)
parser = pyparsing.Dict(pyparsing.OneOrMore(pyparsing.Group(section)))

Now instead of pprint(result.asList()) you can write:

print (result.dump())

to show the Dict hierarchy:

[['sectionOne', ['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- sectionOne: [['list', ['XXitem', 'XXanotherItem']], ... etc. ...
  - key1: value1
  - list: ['XXitem', 'XXanotherItem']
  - mods: ['XXone', 'XXtwo']
  - product: milk
  - release: now
  - subSection: [['skey', 'sval'], ['slist', ['XXitem']]]
    - skey: sval
    - slist: ['XXitem']
  - version: last
- sectionTwo: [['base', 'base-0.1'], ['config', 'config-7.0-7']]
  - base: base-0.1
  - config: config-7.0-7

allowing you to write statements like:

print (result.sectionTwo.base)
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top