How to extract information from text using regexes in Python?

https://stackoverflow.com/questions/11492857

21-06-2021
|

题

I have the following input. I want to parse it to a CSV delimited string. I can get the SKUs through regex patterns, but as I am new to regex parsing, I don't know complex patterns. It would be nice if anyone could help me with this.

Thanks!

    charset="iso-8859-1"


BODY {


}



TD {



}



TH {


}



H1 {


}

TABLE,IMG,A {


}

**PO Number:** 35102


**Ship To:**  


Georgie Clements



6902 Stonegate Drive

Odessa, TX 79765



432-363-8459


SKU



Product



Qty


JJ-Rug-Zebra-PK



Zebra Pink Rug



1

JJ-Zebra-PK-Twin-4



Zebra Pink 4 Piece Twin Comforter Set



1



JJ-TwinSheets-Zebra-PK



Zebra Pink 3 Piece Twin Sheet Set



1




JJ-Memo-Zebra-PK



Zebra Pink Memory Board



1

I want it to format like this:

PONumber, Shipping info, SKU, Product, Qty
'35102', '[ShipToAddress]', 'JJ-Rug-Zebra-PK', 'Zebra Pink Rug', '1'
'35102', '[ShipToAddress]', 'JJ-Zebra-PK-Twin-4', 'Zebra Pink 4 Piece Twin Comforter Set', '1'
'35102', '[ShipToAddress]', 'JJ-TwinSheets-Zebra-PK', 'Zebra Pink 3 Piece Twin Sheet Set', '1'
'35102', '[ShipToAddress]', 'JJ-Memo-Zebra-PK', 'Zebra Pink Memory Board', '1'

The current code is the following:

pattern = re.compile(r'(\b\w*JJ-\S*)') 

pos = 0 
    while True: 
        match = pattern.search(msgStr, pos) 
        if not match: 
            break 
        a = match.start() 
        e = match.end() 
        print ' %2d : %2d = %s' % (a, e-1, msgStr[a:e]) 
        pos = e

解决方案

Here's another solution, not using regular expressions:

s = "(your data as a single multiline string)"

datalines = lambda s: [ln for ln in (line.strip() for line in s.splitlines()) if ln]

_, _, po_number, _, rem = s.split('**')
shipto, data = rem.split('SKU', 1)

po_number = datalines(po_number)[0]
shipto    = '\n'.join(datalines(shipto))
data      = datalines(data)[2:]

res = [[po_number, shipto, sku, prod, qty] for sku,prod,qty in zip(*([iter(data)]*3))]

which gives the final result

[
    ['35102', 'Georgie Clements\n6902 Stonegate Drive\nOdessa, TX 79765\n432-363-8459', 'JJ-Rug-Zebra-PK', 'Zebra Pink Rug', '1'],
    ['35102', 'Georgie Clements\n6902 Stonegate Drive\nOdessa, TX 79765\n432-363-8459', 'JJ-Zebra-PK-Twin-4', 'Zebra Pink 4 Piece Twin Comforter Set', '1'],
    ['35102', 'Georgie Clements\n6902 Stonegate Drive\nOdessa, TX 79765\n432-363-8459', 'JJ-TwinSheets-Zebra-PK', 'Zebra Pink 3 Piece Twin Sheet Set', '1'],
    ['35102', 'Georgie Clements\n6902 Stonegate Drive\nOdessa, TX 79765\n432-363-8459', 'JJ-Memo-Zebra-PK', 'Zebra Pink Memory Board', '1']

Edit: second data file returns

[
    ['35104', 'Angelica Alvarado\n669 66th St.\nSpringfield, OR 97478\n5412322525', 'JJ-CribSheet-Cheetah-PK-PRT', 'Cheetah Pink Print Microsuede Crib Sheet', '1']
]

which on inspection appears to be correct?

Final Summary: I discovered that he was using html2text to convert the html email to text, then trying to parse it. The solution was to instead parse the html directly using BeautifulSoup, taking advantage of the page structure to identify the fields he wanted.

其他提示

As per the comments, this kind of input data is better suited for a stateful parsing approach as opposed to regex solutions. There are certain lines which indicate the parsing state should change to capture a new set of data.

Ideally first off you would see if this data source is available in JSON instead of what I presume is a webscraping of HTML. Having a JSON source will make this process trivial since the data will already be in an object format.

If your only option is to work with this line-by-line source, you would be best off either using something like pyparsing, or if that is considered overkill for your needs, you could loop over the lines and check each one to see if you should start or stop collecting a type of data until the next token.

As a last resort, you could run multiple regex patterns over the entire input. The reason you have to run it over the entire input is because your data spans lines. A basic regex for capturing the SKU/Product/Qty might be:

re.findall(r'(JJ-[\w-]+)\n+(.*?)\n+(\d+)\n', dataStr)
#[('JJ-Rug-Zebra-PK', 'Zebra Pink Rug', '1'),
# ('JJ-Zebra-PK-Twin-4', 'Zebra Pink 4 Piece Twin Comforter Set', '1'),
# ('JJ-TwinSheets-Zebra-PK', 'Zebra Pink 3 Piece Twin Sheet Set', '1'),
# ('JJ-Memo-Zebra-PK', 'Zebra Pink Memory Board', '1')]

This would find each 3 lines containing those patterns and return a list of tuples. I really don't recommend the regex approach but its an option.

Other regexes:

re.search(r'\*{2}PO Number:\*{2}\s(\d+)\n', dataStr)
#('35102',)

re.search(r'\*{2}Ship To:\*{2}\s+(.*?)\s+SKU', dataStr, re.DOTALL)
#('John Doe\n6902 Stonegate Drive\nOdessa, TX 79\n000-000-0000',)

You can see how you would just need to build individual regexes for each bit of data.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow