Question

I've been slowly learning PyParsing and have found it is a great tool with a lot of potential uses, but I'm struggling because of the lack of detailed documentation. Hence, I'm stuck with a problem.

My goal is to parse a CSV file where it has collections of columns that form groups of data. These groups are important for later interpretation of the data in post-processing. Further, the CSV files have optional columns, which is why I really like pyparsing because of it's flexibility.

I've successfully created a parser for validating and correctly parsing the header of the CSV file. However, I have two options to correctly process the rows of data.

1) I could some how create another parser for the data, based on the parserResults of the header. So that the data parser knows which columns it should expect.

OR

2) Read the rows of data as an array and some how retrieve the column number (not character number) of each header field from the header parser.

Below is a toy example to illustrate what I'm trying to achieve.

csv_header_1='''FirstName Surname Address Notes PurchaseOrder OrderDate'''

csv_data_1='''"Bob" "Smith" "123 Lucky Street" "Bad customer" "123ABC", 2013/10/20
"Zoe" "Jackson" "5 Mountain View Street" "Good customer" "abc211" 2014/01/01'''.splitlines()


csv_header_2='''FirstName Surname Address PhoneHome PhoneMobile PurchaseOrder OrderDate Total'''

csv_data_2='''"Bob" "Smith" "123 Lucky Street" "12345678" "1234567890" "123ABC" 2013/10/20, $100
"Zoe" "Jackson" "5 Mountain View Street" "87654321" "0987654321" "abc211" 2014/01/01 $1000'''.splitlines()

# Pyparsing header parser:


print 'Create pyparsing Elements.'
firstname=Literal('FirstName').setResultsName('Firstname')
surname=Literal('Surname').setResultsName('Surname')
address=Literal('Address').setParseAction( lambda tokens: " ".join(tokens)).setResultsName('Address')
notes=Literal('Notes').setResultsName('Notes')
phone_home= Literal('PhoneHome').setResultsName('Home')
phone_mobile= Literal('PhoneMobile').setResultsName('Mobile')
customer=Group(firstname + surname + address + Optional(notes) + Optional(phone_home + phone_mobile) ).setResultsName('Customer')

purchase_order= Literal('PurchaseOrder').setResultsName('Purchase_order')
order_date= Literal('OrderDate').setResultsName('Order_date')
total= Literal('Total').setResultsName('Total')
order = Group(purchase_order + order_date + Optional(total) ).setResultsName('Order')


header=Group( customer + order ).setResultsName('Header')

print 'Parse CSV header 1.'

try:
    parsed_header = header.parseString(csv_header_1)
except ParseException, err:
    print err.line
    print " "*(err.column-1) + "^"
    print err


print 'CSV header 1 dump: ', parsed_header.dump()

try:
    parsed_header = header.parseString(csv_header_2)
except ParseException, err:
    print err.line
    print " "*(err.column-1) + "^"
    print err


print 'CSV header 2 dump: ', parsed_header.dump()

Output:

Create pyparsing Elements.
Parse CSV header 1.
CSV header 1 dump:  [[['FirstName', 'Surname', 'Address', 'Notes'], ['PurchaseOrder', 'OrderDate']]]
- Header: [['FirstName', 'Surname', 'Address', 'Notes'], ['PurchaseOrder', 'OrderDate']]
  - Customer: ['FirstName', 'Surname', 'Address', 'Notes']
    - Address: Address
    - Firstname: FirstName
    - Notes: Notes
    - Surname: Surname
  - Order: ['PurchaseOrder', 'OrderDate']
    - Order_date: OrderDate
    - Purchase_order: PurchaseOrder
CSV header 2 dump:  [[['FirstName', 'Surname', 'Address', 'PhoneHome', 'PhoneMobile'], ['PurchaseOrder', 'OrderDate', 'Total']]]
- Header: [['FirstName', 'Surname', 'Address', 'PhoneHome', 'PhoneMobile'], ['PurchaseOrder', 'OrderDate', 'Total']]
  - Customer: ['FirstName', 'Surname', 'Address', 'PhoneHome', 'PhoneMobile']
    - Address: Address
    - Firstname: FirstName
    - Home: PhoneHome
    - Mobile: PhoneMobile
    - Surname: Surname
  - Order: ['PurchaseOrder', 'OrderDate', 'Total']
    - Order_date: OrderDate
    - Purchase_order: PurchaseOrder
    - Total: Total

The header parser works great, but how can I correctly parse the data rows?

I understand I could write a data parser that is based on the data type of each field, but this will not work as optional columns do not necessarily have unique data types. I need to use the header to determine how many columns there are and the data type in each column.

I can manually create the parser rules below, but I need to create the "customer" and "order" parseElements dynamically some how so it can correctly parse the row data. (please note, the below snippet of code does not handle the double quotes)

firstname=Word(alphas).setResultsName('Firstname')
surname=Word(alphas).setResultsName('Surname')
address=OneOrMore(Word(alphas)).setParseAction( lambda tokens: " ".join(tokens)).setResultsName('Address')
phone_home= Word(nums).setResultsName('Home')
phone_mobile= Word(nums).setResultsName('Mobile')
# customer=Group(firstname + surname + address + Optional(phone_home) + Optional(phone_mobile) ).setResultsName('Customer')

purchase_order= Word(alphas).setResultsName('Purchase_order')
order_date= Combine(nums + "/" + nums + "/" + nums).setResultsName('Date')
total= Group( Suppress('$') + Word(nums) ).setResultsName('Total')
# order = Group(purchase_order + order_date + Optional(total) ).setResultsName('Order')

Any suggestions would be appreciated, thanks for your help.

Update

Below is example output I hope to get from a pyparsing parser for the row of data. The example below is only for a single row of data for each CSV example given above.

CSV data 1 dump:  [[["Bob" "Smith" "123 Lucky Street" "Bad customer"], ["123ABC", 2013/10/20]]]
- Header: [["Bob" "Smith" "123 Lucky Street" "Bad customer"], ["123ABC", 2013/10/20]]
  - Customer: ["Bob" "Smith" "123 Lucky Street" "Bad customer"]
    - Address: "123 Lucky Street"
    - Firstname: "Bob"
    - Notes: "Bad customer"
    - Surname: Smith"
  - Order: ["123ABC", 2013/10/20]
    - Order_date: 2013/10/20
    - Purchase_order: "123ABC"


CSV data 2 dump:  [[["Bob" "Smith" "123 Lucky Street" "12345678" "1234567890"], [ "123ABC" 2013/10/20, $100]]]
- Header: [["Bob" "Smith" "123 Lucky Street" "12345678" "1234567890"], [ "123ABC" 2013/10/20, $100]]
  - Customer: ["Bob" "Smith" "123 Lucky Street" "12345678" "1234567890"]
    - Address: "123 Lucky Street"
    - Firstname: "Bob"
    - Home: "12345678"
    - Mobile: "1234567890"
    - Surname: "Smith"
  - Order: [ "123ABC" 2013/10/20, $100]
    - Order_date: 2013/10/20
    - Purchase_order: "123ABC"
    - Total: $100

This is just an example, but I'm open to a different approach as suggested by Jan and EOL.

Was it helpful?

Solution

CSV file processing

Check documentation for csv module, being builtin one, and there you will find DictReader, which allows you to process CSV file with a header, and providing iterator, which for each record/line returns a dictionary having for each field name a key and related value.

Having this data in "data.csv" file:

name;surname
Jan;Vlcinsky
Pieter;Pan
Jane;Fonda

you can then test from console:

>>> from csv import DictReader
>>> fname = "data.csv"
>>> f = open(fname)
>>> reader = DictReader(f, delimiter=";")
>>> for rec in reader:
...     print rec
...
{'surname': 'Vlcinsky', 'name': 'Jan'}
{'surname': 'Pan', 'name': 'Pieter'}
{'surname': 'Fonda', 'name': 'Jane'}

Using your data and emulating open files using StringIO:

from StringIO import StringIO
from csv import DictReader

data1 = """
FirstName Surname Address Notes PurchaseOrder OrderDate
"Bob" "Smith" "123 Lucky Street" "Bad customer" "123ABC", 2013/10/20
"Zoe" "Jackson" "5 Mountain View Street" "Good customer" "abc211" 2014/01/01
""".strip()


data2 = """
FirstName Surname Address PhoneHome PhoneMobile PurchaseOrder OrderDate Total
"Bob" "Smith" "123 Lucky Street" "12345678" "1234567890" "123ABC" 2013/10/20, $100
"Zoe" "Jackson" "5 Mountain View Street" "87654321" "0987654321" "abc211" 2014/01/01 $1000
""".strip()

buf1 = StringIO(data1)
buf2 = StringIO(data2)

reader = DictReader(buf1, delimiter=" ")
for rec in reader:
    print rec

print "---next one comes---"

reader = DictReader(buf2, delimiter=" ")
for rec in reader:
    print rec

What will show:

{'Surname': 'Smith', 'FirstName': 'Bob', 'Notes': 'Bad customer', 'PurchaseOrder': '123ABC,', 'Address': '123 Lucky Street', 'OrderDate': '2013/10/20'}
{'Surname': 'Jackson', 'FirstName': 'Zoe', 'Notes': 'Good customer', 'PurchaseOrder': 'abc211', 'Address': '5 Mountain View Street', 'OrderDate': '2014/01/01'}
---next one comes---
{'Surname': 'Smith', 'FirstName': 'Bob', 'PhoneMobile': '1234567890', 'PhoneHome': '12345678', 'PurchaseOrder': '123ABC', 'Address': '123 Lucky Street', 'Total': '$100', 'OrderDate': '2013/10/20,'}
{'Surname': 'Jackson', 'FirstName': 'Zoe', 'PhoneMobile': '0987654321', 'PhoneHome': '87654321', 'PurchaseOrder': 'abc211', 'Address': '5 Mountain View Street', 'Total': '$1000', 'OrderDate': '2014/01/01'}

This way we have the parsing part done and the only remaining thing is to create proper objects from them later on.

Playing with classes and printing

The question is using PyParser as sort of class instances. Here comes an example, how we can create classes of our own.

File classes.py:

class Base():
    templ = """
    - Base:
        - ????
    """
    reprtempl = "<Base: {self.__dict__}>"
    def report(self):
        return self.templ.strip().format(self=self)
    def __repr__(self):
        return self.reprtempl.format(self=self)


class Customer(Base):
    templ = """
    - Customer:
        - Address: {self.address}
        - Firstname: {self.first_name}
        - Surname: {self.surname}
        - Notes: {self.notes}
    """
    reprtempl = "<Customer: {self.__dict__}>"

    def __init__(self, first_name, surname, address, phone_home=None, phone_mobile=None, notes=None, **kwargs):
        self.first_name = first_name
        self.surname = surname
        self.address = address
        self.notes = notes
        self.phone_home = phone_home
        self.phone_mobile = phone_mobile

class Order(Base):
    templ = """
    - Order:
        - Order_date: {self.order_date}
        - Purchase_order: {self.purchase_order}
        - Total: {self.total}
    """
    reprtempl = "<Order: {self.__dict__}>"

    def __init__(self, order_date, purchase_order, total=None, **kwargs):
        self.order_date = order_date
        self.purchase_order = purchase_order
        self.total = total

if __name__ == "__main__":
    customer_dct = {"first_name": "Bob", "surname": "Smith", "address": "Sezam Street 1A",
            "phone_home": "11223344", "phone_mobile": "88990077"}
    customer = Customer(**customer_dct)
    print customer
    print customer.report()
    order_dct = {"order_date": "2014/01/01", "purchase_order": "abc211", "total": "$12"}
    order = Order(**order_dct)
    print order
    print order.report()

Base class is implementing __repr__ and report and is common base for following classes Customer and Order.

Constructors are using default values (for cases, we expect given attribute being sometime missing) and **kwargs which makes the constructor tolerant to extra (unexpected) named parameters.

Final section if __name__ ... include short testing code. If you run

$ python classes.py

you would see class instance and use in action.

Using classes togather with csv reading

Note: Following code uses a bit modified names of fields - just to follow naming conventions in Python classes. Original field names would be usable, but to follow naming conventions in the classes, some keyword translation step would have to be added (and I skipped that).

from StringIO import StringIO
from csv import DictReader
from classes import Customer, Order

data1 = """
first_name surname address notes purchase_order order_date
"Bob" "Smith" "123 Lucky Street" "Bad customer" "123ABC", 2013/10/20
"Zoe" "Jackson" "5 Mountain View Street" "Good customer" "abc211" 2014/01/01
""".strip()


data2 = """
first_name surname address phone_home phone_mobile purchase_order order_date total
"Bob" "Smith" "123 Lucky Street" "12345678" "1234567890" "123ABC" 2013/10/20, $100
"Zoe" "Jackson" "5 Mountain View Street" "87654321" "0987654321" "abc211" 2014/01/01 $1000
""".strip()

buf1 = StringIO(data1)
buf2 = StringIO(data2)

reader = DictReader(buf1, delimiter=" ")
for rec in reader:
    print rec
    customer = Customer(**rec)
    print customer.report()
    order = Order(**rec)
    print order
    print order.report()

print "---next one comes---"

reader = DictReader(buf2, delimiter=" ")
for rec in reader:
    print rec
    customer = Customer(**rec)
    print customer.report()
    order = Order(**rec)
    print order
    print order.report()

Conclusions

  • python csv allows reading into DictReader, which provides records in form of dictionary item
  • custom classes in Python can be created, can allow construction using set of parameters from keyword, and allow implementation of handy methods (here e.g. report).
  • example could be further extended, e.g. to manage relations between customer and order, but this is out of scope of this answer.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top