Lookahead assertions and grouping with python regex

https://stackoverflow.com/questions/18282035

24-06-2022
|

Question

Say I have many lines of text, such as this one:

row = '   S.G. Primary School\t\t 434,612.50'

And I want to find a number that looks like it is formatted the way accountants do, then I want to look backwards and pull the word or words preceeding that number. I have this for the number:

test = re.search(r"""(?=((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$))""",row)
   S.G. Primary School       434,612.50
test.groups()
('434,612.50', '434', ',612', '.50')

Which looks correct. I have the full number and the parts of it (all of which I want). But I cannot figure out how to get the word (or phrase) before the number with a look ahead assertion.

I tried:

test = re.search(r"""([A-Za-z ].*) (?=((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$))""",row)
('   S.G. Primary School\t\t', '434,612.50', '434', ',612', '.50')

I spent 4 hours this week rereading regex docs and I still don't know if I am getting anywhere. Examples don't seem to work for me. I cannot use \w+ because I want the labels to be only text and spaces, but I also want to start counting backwards from the start of a matching number. That sounds like a "positive lookahead assertion" with the general format of "\w+(?=\d)" but that doesn't work for me.

Also - I am confused about the proper way to assign MULTIPLE lookahead assertions that ALL need to be true before the match returns:

r"""([A-Za-z ]*)(.*?)([\d,.]+)(?=[A-Za-z ]*)(?=[\d,.])"""

any different from

r"""([A-Za-z ]*)(?=[A-Za-z ]*)(.*?)([\d,.]+)(?=[\d,.])"""

because both yield the same result in this example:

('   S', '.G. Primary School\t\t ', '434,612.5')

UPDATE

Here are three examples for which I am stuggling to find a regex answer:

import re
rows = ['   S.G. Primary School\t\t 434,612.50',
       '   S.G. Bad Primary School\t\t 434,612.50',
       '   N.3#=42^2492q\t\t\t 434,612.50']

for row in rows:
    test = re.search(r"""(?!\s)([A-Za-z]{0,25}) ?([a-zA-Z]{6,25}).*?(?=(?:(?:-?\d{1,3})(?:,\d{3})*(?:\.\d\d)?$|^\.\d\d$))((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$)""",row)
    if test != None:
        print test.groups()
    else:
        print test

This returns:

('Primary', 'School', '434,612.50', '434', ',612', '.50')
('Bad', 'Primary', '434,612.50', '434', ',612', '.50')
None

I would like the result to be:

('Primary', 'School', '434,612.50', '434', ',612', '.50')
('Primary', 'School', '434,612.50', '434', ',612', '.50')
('', '434,612.50', '434', ',612', '.50')

And I would like the code to be adjustable so that I could also return:

('School', '434,612.50', '434', ',612', '.50')
('School', '434,612.50', '434', ',612', '.50')
('', '434,612.50', '434', ',612', '.50')

with modifications.

UPDATE

Based on Casimir's answer, this returns better data but I do not understand how how get multiple word phrases preceeding the number:

test = re.search(r'([A-Za-z][A-Za-z_.]*){1,2}\s+((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$)',row)
('School', '434,612.50', '434', ',612', '.50')
('School', '434,612.50', '434', ',612', '.50')
('q', '434,612.50', '434', ',612', '.50')

and I don't know why

test = re.search(r'([A-Za-z_.]*){1,2}\s+((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$)',row)

Gives an error: nothing to repeat. All I've done is change

[A-Za-z][A-Za-z_.]*){1,2}

[A-Za-z_.]*){1,2}

in the first group.

Perhaps:

test = re.search(r'([A-Za-z][A-Za-z_.]*){0,}\s+([A-Za-z][A-Za-z_.]*){0,}\s+((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$)',row)

is better, because I get the first word and the second word back, but not sure how I can combine them and make them optional:

('Primary', 'School', '434,612.50', '434', ',612', '.50')
('Primary', 'School', '434,612.50', '434', ',612', '.50')
('q', None, '434,612.50', '434', ',612', '.50')

UPDATE

I've taken Casimir's answer (slightly modified) {0,2} changed to {0,1} and tested it with a findall version:

import re
rows = ['   S.G. Primary School\t\t 434,612.50 S.G. Primary School\t\t 434,612.50',
       '   S.G. Bad Primary School\t\t 434,612.50 Bad Primary School\t\t 434,612.50',
       '   N.3#=42^2492q\t\t\t 434,612.50  N.3#=42^2492q\t\t\t 434,612.50  N.3#=42^2492q\t\t\t 434,612.50 ']

for row in rows:
    test = re.findall(r"(?i)([a-z][a-z_.]*(?:\s+[a-z][a-z_.]*){0,1})?\s+((-?\d{1,3})(?:,\d{3})*(?:\.\d\d)?$|^\.\d\d$)",row)
    test = re.findall(r"(?i)([a-z][a-z_.]*(?:\s+[a-z][a-z_.]*){0,1})?\s+(-?\d{1,3}(?:,\d{3})*(?:\.\d\d)?)",row)
    print test

But the first test returns this (when second test statement is commented out):

[('Primary School', '434,612.50', '434')]
[('Primary School', '434,612.50', '434')]
[]

And the second test statement returns this, a list of results - what I want, sorta:

[('Primary School', '434,612.50'), ('Primary School', '434,612.50')]
[('Primary School', '434,612.50'), ('Primary School', '434,612.50')]
[('q', '434,612.50'), ('q', '434,612.50'), ('q', '434,612.50')]

But the statements are so similar, I don't know why one is missing the multiple numbers / labels in the list.

Solution

You don't need lookahead at all:

(?i)([a-z][a-z_.]*(?:\s+[a-z][a-z_.]*){0,2})?\s+(-?\d{1,3}(?:,\d{3})*(?:\.\d\d)?)

With {0,...} you can control how many words you want. If you want all the words, replace it by *. If you want one word max, you must remove all the non-capturing group:

(?i)([a-z][a-z_.]*)?\s+(-?\d{1,3}(?:,\d{3})*(?:\.\d\d)?)

if you want exactly 3 words:

(?i)([a-z][a-z_.]*(?:\s+[a-z][a-z_.]*){2})\s+(-?\d{1,3}(?:,\d{3})*(?:\.\d\d)?)

If you want to avoid single letters from "a non word" (like the "q" letter) you can add:

(?i)((?:^|(?<=\s))[a-z][a-z_.]*(?:\s+[a-z][a-z_.]*){0,2})?\s+(-?\d{1,3}(?:,\d{3})*(?:\.\d\d)?)

pattern details:

(?i)                      # make the pattern case insensitive
(                         # open the first capturing group
    (?:^|(?<=\s))         # begining of the string or lookbehind with space
    [a-z][a-z_.]*         # a letter and zero or more chars from [a-z_.]
    (?:                   # open a non-capturing group
        \s+               # one or more spaces
        [a-z][a-z_.]*     # a letter and zero or more chars from [a-z_.]
    ){0,2}                # repeat the capturing group zero or two times
)?                        # close the capturing group and make it optional
\s+                       # one or more spaces
(                         # open a capturing group
    -?                    # - sign optional
    \d{1,3}               # between 1 or 3 digits
    (?:,\d{3})*           # a group (zero or more times) with a , and 3 digits
    (?:\.\d\d)?           # an optional group with a . and 2 digits
)                         # close the second capturing group.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow