How to write the grammar for this in pyparsing: match a set of words but not containing a given pattern
Question
I am new to Python and pyparsing. I need to accomplish the following.
My sample line of text is like this:
12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt) 23 Mar 2009
I need to extract the item description, period
tok_date_in_ddmmmyyyy = Combine(Word(nums,min=1,max=2)+ " " + Word(alphas, exact=3) + " " + Word(nums,exact=4))
tok_period = Combine((tok_date_in_ddmmmyyyy + " to " + tok_date_in_ddmmmyyyy)|tok_date_in_ddmmmyyyy)
tok_desc = Word(alphanums+"-()") but stop before tok_period
How to do this?
Solution
I would suggest looking at SkipTo as the pyparsing class that is most appropriate, since you have a good definition of the unwanted text, but will accept pretty much anything before that. Here are a couple of ways to use SkipTo:
text = """\
12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt) 23 Mar 2009"""
# using tok_period as defined in the OP
# parse each line separately
for tx in text.splitlines():
print SkipTo(tok_period).parseString(tx)[0]
# or have pyparsing search through the whole input string using searchString
for [[td,_]] in SkipTo(tok_period,include=True).searchString(text):
print td
Both for
loops print the following:
12 items - Ironing Service
Washing service (3 Shirt)
OTHER TIPS
M K Saravanan, this particular parsing problem is not so hard to do with good 'ole re:
import re
import string
text='''
12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt) 23 Mar 2009
This line does not match
'''
date_pat=re.compile(
r'(\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?)')
for line in text.splitlines():
if line:
try:
description,period=map(string.strip,date_pat.split(line)[:2])
print((description,period))
except ValueError:
# The line does not match
pass
yields
# ('12 items - Ironing Service', '11 Mar 2009 to 10 Apr 2009')
# ('Washing service (3 Shirt)', '23 Mar 2009')
The main workhorse here is of course the re pattern. Let's break it apart:
\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}
is the regexp for a date, the equivalent of tok_date_in_ddmmmyyyy
. \d{1,2}
matches one or two digits, \s+
matches one or more whitespaces, [a-zA-Z]{3}
matches 3 letters, etc.
(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?
is a regexp surrounded by (?:...)
.
This indicates a non-grouping regexp. Using this, no group (e.g. match.group(2)) is assigned to this regexp. This matters because date_pat.split() returns a list with each group being a member of the list. By suppressing the grouping, we keep the entire period 11 Mar 2009 to 10 Apr 2009
together. The question mark at the end indicates that this pattern may occur zero or once. This allows the regexp to match both
23 Mar 2009
and 11 Mar 2009 to 10 Apr 2009
.
text.splitlines()
splits text on \n
.
date_pat.split('12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009')
splits the string on the date_pat regexp. The match is included in the returned list. Thus we get:
['12 items - Ironing Service ', '11 Mar 2009 to 10 Apr 2009', '']
map(string.strip,date_pat.split(line)[:2])
prettifies the result.
If line
does not match date_pat
, then date_pat.split(line)
returns [line,]
,
so
description,period=map(string.strip,date_pat.split(line)[:2])
raises a ValueError because we can't unpack a list with only one element into a 2-tuple. We catch this exception but simply pass on to the next line.