如何在pyparsing中为此编写语法：匹配一组单词但不包含给定模式

https://stackoverflow.com/questions/1805309

05-07-2019
|

题

我是Python和pyparsing的新手。我需要完成以下任务。

我的示例文字行如下：

12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009

我需要提取项目描述，期间

tok_date_in_ddmmmyyyy = Combine(Word(nums,min=1,max=2)+ " " + Word(alphas, exact=3) + " " + Word(nums,exact=4))
tok_period = Combine((tok_date_in_ddmmmyyyy + " to " + tok_date_in_ddmmmyyyy)|tok_date_in_ddmmmyyyy)

tok_desc =  Word(alphanums+"-()") but stop before tok_period

怎么做？

解决方案

我建议将SkipTo视为最适合的pyparsing类，因为你对不需要的文本有一个很好的定义，但在此之前几乎可以接受任何东西。以下是使用SkipTo的几种方法：

text = """\
12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009"""

# using tok_period as defined in the OP

# parse each line separately
for tx in text.splitlines():
    print SkipTo(tok_period).parseString(tx)[0]

# or have pyparsing search through the whole input string using searchString
for [[td,_]] in SkipTo(tok_period,include=True).searchString(text):
    print td

for 循环打印以下内容：

12 items - Ironing Service    
Washing service (3 Shirt)

其他提示

M K Saravanan，这个特殊的解析问题并不是那么难以做好的事情：

import re
import string

text='''
12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009
This line does not match
'''

date_pat=re.compile(
    r'(\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?)')
for line in text.splitlines():
    if line:
        try:
            description,period=map(string.strip,date_pat.split(line)[:2])
            print((description,period))
        except ValueError:
            # The line does not match
            pass

产量

# ('12 items - Ironing Service', '11 Mar 2009 to 10 Apr 2009')
# ('Washing service (3 Shirt)', '23 Mar 2009')

这里的主要工作当然是重新模式。让我们分开吧：

\ d {1,2} \ s + [a-zA-Z] {3} \ s + \ d {4} 是日期的正则表达式，相当于 tok_date_in_ddmmmyyyy 。 \ d {1,2} 匹配一个或两个数字， \ s + 匹配一个或多个空格， [a-zA-Z] {3} 匹配3个字母等



 （？：\ s + to \ s + \ d {1,2} \ s + [a-zA-Z] {3} \ s + \ d {4}）？是一个正则表达式被（？：...）包围。
这表示非分组正则表达式。使用此，没有组（例如match.group（2））被分配给此正则表达式。这很重要，因为date_pat.split（）返回一个列表，其中每个组都是列表的成员。通过抑制分组，我们将整个期间 2009年3月11日至2009年4月10日保持在一起。最后的问号表示此模式可能出现零或一次。这允许正则表达式匹配两者
 2009年3月23日和 2009年3月11日至2009年4月10日。

  text.splitlines（）在 \ n 上拆分文本。

  date_pat.split（'12 items  -  Ironing Service 2009年3月11日至2009年4月10日'） 

在date_pat regexp上拆分字符串。匹配包含在返回的列表中。
因此我们得到：

  ['12 items-Ironing Service'，'2009年3月11日至2009年4月10日'，'']  

  map（string.strip，date_pat.split（line）[：2]）美化结果。

如果 line 与 date_pat 不匹配，则 date_pat.split（line）返回 [line，]  ，
所以

 <代码>的描述中，周期=地图（string.strip，date_pat.split（线）[：2]） 

引发了一个ValueError，因为我们无法将只有一个元素的列表解包为2元组。我们抓住了这个例外，但只是转到下一行。



	
		
			许可以下： CC-BY-SA 和 归因
			不隶属于 StackOverflow