Вопрос

I recently started using pyparsing and I'm stuck with following: There are data organized in columns where number of columns is not known and additionally such section can occur multiple times in input. Please see code below for example.

# -*- coding: utf-8 -*-

from pyparsing import *
from decimal import Decimal

def convert_float(a):
    return Decimal(a[0].replace(',','.'))

def convert_int(a):
    return int(a[0])

NL = LineEnd().suppress()

dot = Literal('.')
dates = Combine(Word(nums,exact=2) + dot + Word(nums,exact=2) + dot + Word(nums,exact=4))
day_with_date = Word(alphas,exact=3).suppress() + dates

amount = ( Combine(OneOrMore(Word(nums)) + ',' + Word(nums),adjacent=False) + 
           Optional(Literal('EUR')).suppress() ).setParseAction(convert_float)
number = Word(nums).setParseAction(convert_int)

item_head = OneOrMore(Keyword('Item').suppress() + number)
item_det = Forward()
item_foot = Forward()

def defineColNumber(t):
    nbcols = len(t)#[0])
    item_det << Dict(Group(day_with_date('date') + Group(nbcols*amount)('data')))
    item_foot << Keyword('TOTAL').suppress() + Group(nbcols*amount)

sec = (item_head('it*').setParseAction(defineColNumber) + 
       Group(OneOrMore(item_det))('details*') + 
       item_foot('totals*'))

parser = OneOrMore(
             sec
         )
parser.ignore(NL)

out = """
                             Item 1             Item 2             Item 3
Sat 20.04.2013     3 126 375,00 EUR     115 297,00 EUR      67 830,00 EUR      
Fri 19.04.2013     1 641 019,20 EUR      82 476,00 EUR      48 759,00 EUR      
Thu 18.04.2013       548 481,10 EUR      46 383,00 EUR      29 810,00 EUR      
Wed 17.04.2013       397 396,70 EUR      42 712,00 EUR      26 812,00 EUR 
TOTAL              8 701 732,00 EUR   1 661 563,00 EUR   1 207 176,00 EUR

                             Item 4             Item 5
Sat 20.04.2013       126 375,00 EUR     215 297,00 EUR      
Fri 19.04.2013     2 641 019,20 EUR      32 476,00 EUR      
Thu 18.04.2013       548 481,10 EUR      56 383,00 EUR      
Wed 17.04.2013       397 396,70 EUR      42 712,00 EUR
TOTAL              2 701 732,00 EUR   1 663 563,00 EUR   

"""

p = parser.parseString(out, parseAll=True)
print p.dump()
print p.it
print p.details[0]['18.04.2013'].data[2]
print p.totals

Currently for example p.it looks like [[1, 2, 3], [4, 5]] What I need to have is [1,2,3,4,5] as well for other parts, so instead of p.details[0]['18.04.2013'].data[2] I could do p.details['18.04.2013'].data[2] then.

I'm out of ideas - is it possible to join results in some easy way or I need to change ParseResults with some other function?

Thanks for help.

BTW - is this code makes sense regarding parsing dates, amount, etc.?

Это было полезно?

Решение

This kind of parsing of tabular data is one of the original cases that pyparsing was written for. Congratulations on getting this far with parsing a non-trivial input text!

Rather than try to do any unnatural Grouping or whatnot to twist or combine the parsed data into your desired data structure, I'd just walk the parsed results as you've got them and build up a new summary structure, which I'll call summary. We are actually going to accumulate data into this dict, which strongly suggests using a defaultdict for simplified initialization of the summary when a new key is found.

from collections import defaultdict
summary = defaultdict(dict)

Looking at the current structure returned in p, you are getting item headers and detailed data sets gathered into the named results it and details. We can zip these together to get each section's headers and data. Then for each line in the details, we'll make a dict of the detailed values by zipping the item headers with the parsed data values. Then we'll update the summary value that is keyed by the line.date:

for items,details in zip(p.it,p.details):
    for line in details:
        summary[line.date[0]].update(dict(zip(items,line.data)))

Done! See what the keys are that we have accumulated:

print summary.keys()

gives:

['20.04.2013', '18.04.2013', '17.04.2013', '19.04.2013']

Print the data accumulated for '18.04.2013':

print summary['18.04.2013']

gives:

{1: Decimal('548481.10'), 2: Decimal('46383.00'), 3: Decimal('29810.00'), 4: Decimal('548481.10'), 5: Decimal('56383.00')}
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top