Say I have many lines of text, such as this one:
row = ' S.G. Primary School\t\t 434,612.50'
And I want to find a number that looks like it is formatted the way accountants do, then I want to look backwards and pull the word or words preceeding that number. I have this for the number:
test = re.search(r"""(?=((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$))""",row)
S.G. Primary School 434,612.50
test.groups()
('434,612.50', '434', ',612', '.50')
Which looks correct. I have the full number and the parts of it (all of which I want). But I cannot figure out how to get the word (or phrase) before the number with a look ahead assertion.
I tried:
test = re.search(r"""([A-Za-z ].*) (?=((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$))""",row)
(' S.G. Primary School\t\t', '434,612.50', '434', ',612', '.50')
I spent 4 hours this week rereading regex docs and I still don't know if I am getting anywhere. Examples don't seem to work for me. I cannot use \w+ because I want the labels to be only text and spaces, but I also want to start counting backwards from the start of a matching number. That sounds like a "positive lookahead assertion" with the general format of "\w+(?=\d)" but that doesn't work for me.
Also - I am confused about the proper way to assign MULTIPLE lookahead assertions that ALL need to be true before the match returns:
is
r"""([A-Za-z ]*)(.*?)([\d,.]+)(?=[A-Za-z ]*)(?=[\d,.])"""
any different from
r"""([A-Za-z ]*)(?=[A-Za-z ]*)(.*?)([\d,.]+)(?=[\d,.])"""
because both yield the same result in this example:
(' S', '.G. Primary School\t\t ', '434,612.5')
UPDATE
Here are three examples for which I am stuggling to find a regex answer:
import re
rows = [' S.G. Primary School\t\t 434,612.50',
' S.G. Bad Primary School\t\t 434,612.50',
' N.3#=42^2492q\t\t\t 434,612.50']
for row in rows:
test = re.search(r"""(?!\s)([A-Za-z]{0,25}) ?([a-zA-Z]{6,25}).*?(?=(?:(?:-?\d{1,3})(?:,\d{3})*(?:\.\d\d)?$|^\.\d\d$))((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$)""",row)
if test != None:
print test.groups()
else:
print test
This returns:
('Primary', 'School', '434,612.50', '434', ',612', '.50')
('Bad', 'Primary', '434,612.50', '434', ',612', '.50')
None
I would like the result to be:
('Primary', 'School', '434,612.50', '434', ',612', '.50')
('Primary', 'School', '434,612.50', '434', ',612', '.50')
('', '434,612.50', '434', ',612', '.50')
And I would like the code to be adjustable so that I could also return:
('School', '434,612.50', '434', ',612', '.50')
('School', '434,612.50', '434', ',612', '.50')
('', '434,612.50', '434', ',612', '.50')
with modifications.
UPDATE
Based on Casimir's answer, this returns better data but I do not understand how how get multiple word phrases preceeding the number:
test = re.search(r'([A-Za-z][A-Za-z_.]*){1,2}\s+((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$)',row)
('School', '434,612.50', '434', ',612', '.50')
('School', '434,612.50', '434', ',612', '.50')
('q', '434,612.50', '434', ',612', '.50')
and I don't know why
test = re.search(r'([A-Za-z_.]*){1,2}\s+((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$)',row)
Gives an error: nothing to repeat. All I've done is change
[A-Za-z][A-Za-z_.]*){1,2}
to
[A-Za-z_.]*){1,2}
in the first group.
Perhaps:
test = re.search(r'([A-Za-z][A-Za-z_.]*){0,}\s+([A-Za-z][A-Za-z_.]*){0,}\s+((-?\d{1,3})(,\d{3})*(\.\d\d)?$|^\.\d\d$)',row)
is better, because I get the first word and the second word back, but not sure how I can combine them and make them optional:
('Primary', 'School', '434,612.50', '434', ',612', '.50')
('Primary', 'School', '434,612.50', '434', ',612', '.50')
('q', None, '434,612.50', '434', ',612', '.50')
UPDATE
I've taken Casimir's answer (slightly modified) {0,2} changed to {0,1} and tested it with a findall version:
import re
rows = [' S.G. Primary School\t\t 434,612.50 S.G. Primary School\t\t 434,612.50',
' S.G. Bad Primary School\t\t 434,612.50 Bad Primary School\t\t 434,612.50',
' N.3#=42^2492q\t\t\t 434,612.50 N.3#=42^2492q\t\t\t 434,612.50 N.3#=42^2492q\t\t\t 434,612.50 ']
for row in rows:
test = re.findall(r"(?i)([a-z][a-z_.]*(?:\s+[a-z][a-z_.]*){0,1})?\s+((-?\d{1,3})(?:,\d{3})*(?:\.\d\d)?$|^\.\d\d$)",row)
test = re.findall(r"(?i)([a-z][a-z_.]*(?:\s+[a-z][a-z_.]*){0,1})?\s+(-?\d{1,3}(?:,\d{3})*(?:\.\d\d)?)",row)
print test
But the first test returns this (when second test statement is commented out):
[('Primary School', '434,612.50', '434')]
[('Primary School', '434,612.50', '434')]
[]
And the second test statement returns this, a list of results - what I want, sorta:
[('Primary School', '434,612.50'), ('Primary School', '434,612.50')]
[('Primary School', '434,612.50'), ('Primary School', '434,612.50')]
[('q', '434,612.50'), ('q', '434,612.50'), ('q', '434,612.50')]
But the statements are so similar, I don't know why one is missing the multiple numbers / labels in the list.