Qual è il modo migliore (a prova di errore/infallibile) per analizzare un file utilizzando Python con il seguente formato?

https://stackoverflow.com/questions/493484

20-08-2019
|

Domanda

########################################
# some comment
# other comment
########################################

block1 {
    value=data
    some_value=some other kind of data
    othervalue=032423432
    }

block2 {
    value=data
    some_value=some other kind of data
    othervalue=032423432
    }

Soluzione

Il modo migliore sarebbe usare un formato esistente come JSON.

Ecco un esempio di parser per il tuo formato:

from lepl import (AnyBut, Digit, Drop, Eos, Integer, Letter,
                  NON_GREEDY, Regexp, Space, Separator, Word)

# EBNF
# name = ( letter | "_" ) , { letter | "_" | digit } ;
name = Word(Letter() | '_',
            Letter() | '_' | Digit())
# words = word , space+ , word , { space+ , word } ;
# two or more space-separated words (non-greedy to allow comment at the end)
words = Word()[2::NON_GREEDY, ~Space()[1:]] > list
# value = integer | word | words  ;
value = (Integer() >> int) | Word() | words
# comment = "#" , { all characters - "\n" } , ( "\n" | EOF ) ;
comment = '#' & AnyBut('\n')[:] & ('\n' | Eos())

with Separator(~Regexp(r'\s*')):
    # statement = name , "=" , value ;
    statement = name & Drop('=') & value > tuple
    # suite     = "{" , { comment | statement } , "}" ;
    suite     = Drop('{') & (~comment | statement)[:] & Drop('}') > dict
    # block     = name , suite ;
    block     = name & suite > tuple
    # config    = { comment | block } ;
    config    = (~comment | block)[:] & Eos() > dict

from pprint import pprint

pprint(config.parse(open('input.cfg').read()))

Output:

[{'block1': {'othervalue': 32423432,
             'some_value': ['some', 'other', 'kind', 'of', 'data'],
             'value': 'data'},
  'block2': {'othervalue': 32423432,
             'some_value': ['some', 'other', 'kind', 'of', 'data'],
             'value': 'data'}}]

Altri suggerimenti

Bene, i dati sembrano abbastanza regolari. Quindi potresti fare qualcosa del genere (non testato):

class Block(object):
    def __init__(self, name):
        self.name = name

infile = open(...)  # insert filename here
current = None
blocks = []

for line in infile:
    if line.lstrip().startswith('#'):
        continue
    elif line.rstrip().endswith('{'):
        current = Block(line.split()[0])
    elif '=' in line:
        attr, value = line.strip().split('=')
        try:
            value = int(value)
        except ValueError:
            pass
        setattr(current, attr, value)
    elif line.rstrip().endswith('}'):
        blocks.append(current)

Il risultato sarà un elenco di istanze di blocco, in cui block.name sarà il nome ('block1', 'block2', ecc.) e altri attributi corrispondono alle chiavi nei dati. Quindi, blocks[0].value saranno 'dati', ecc. Nota che questo gestisce solo stringhe e numeri interi come valori.

(c'è un ovvio bug qui se le tue chiavi possono mai includere "nome". Potresti cambiare self.name in self._name o qualcosa se ciò può accadere)

HTH!

Se non intendi veramente analizzare, ma piuttosto l'elaborazione del testo e i dati di input sono davvero così regolari, allora vai con la soluzione di John. Se hai davvero bisogno di un po 'di analisi (come ci sono alcune regole un po' più complesse per i dati che stai ottenendo), quindi a seconda della quantità di dati che devi analizzare, andrei con pyparsing o simpleparse . Li ho provati entrambi, ma in realtà il pyparsing era troppo lento per me.

Potresti esaminare qualcosa come pyparsing.

Grako (per compilatore di grammatica) permette di separare la specifica del formato di input (grammatica) dalla sua interpretazione (semantica).Ecco la grammatica per il formato di input nella varietà di Grako EBNF:

(* a file contains zero or more blocks *)
file = {block} $;
(* a named block has at least one assignment statement *)
block = name '{' {assignment}+ '}';
assignment = name '=' value NEWLINE;
name = /[a-z][a-z0-9_]*/;
value = integer | string;
NEWLINE = /\n/;
integer = /[0-9]+/;
(* string value is everything until the next newline *)
string = /[^\n]+/;

Installare grako, correre pip install grako.Per generare il PEG parser dalla grammatica:

$ grako -o config_parser.py Config.ebnf

Per convertire stdin in json utilizzando il file generato config_parser modulo:

#!/usr/bin/env python
import json
import string
import sys
from config_parser import ConfigParser

class Semantics(object):
    def file(self, ast):
        # file = {block} $
        # all blocks should have unique names within the file
        return dict(ast)
    def block(self, ast):
        # block = name '{' {assignment}+ '}'
        # all assignment statements should use unique names
        return ast[0], dict(ast[2])
    def assignment(self, ast):
        # assignment = name '=' value NEWLINE
        # value = integer | string
        return ast[0], ast[2] # name, value
    def integer(self, ast):
        return int(ast)
    def string(self, ast):
        return ast.strip() # remove leading/trailing whitespace

parser = ConfigParser(whitespace='\t\n\v\f\r ', eol_comments_re="#.*?$")
ast = parser.parse(sys.stdin.read(), rule_name='file', semantics=Semantics())
json.dump(ast, sys.stdout, indent=2, sort_keys=True)

Produzione

{
  "block1": {
    "othervalue": 32423432,
    "some_value": "some other kind of data",
    "value": "data"
  },
  "block2": {
    "othervalue": 32423432,
    "some_value": "some other kind of data",
    "value": "data"
  }
}

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow