FInd a US street address in text (preferably using Python regex)

https://stackoverflow.com/questions/18368086

26-06-2022
|

Domanda

Disclaimer: I read very carefully this thread: Street Address search in a string - Python or Ruby and many other resources.

Nothing works for me so far.

In some more details here is what I am looking for is:

The rules are relaxed and I definitely am not asking for a perfect code that covers all cases; just a few simple basic ones with assumptions that the address should be in the format:

a) Street number (1...N digits);

b) Street name : one or more words capitalized;

b-2) (optional) would be best if it could be prefixed with abbrev. "S.", "N.", "E.", "W."

c) (optional) unit/apartment/etc can be any (incl. empty) number of arbitrary characters

d) Street "type": one of ("st.", "ave.", "way");

e) City name : 1 or more Capitalized words;

f) (optional) state abbreviation (2 letters)

g) (optional) zip which is any 5 digits.

None of the above needs to be a valid thing (e.g. an existing city or zip).

I am trying expressions like these so far:

pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)

>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")

Don't work, and for me it's not easy to understand why. Specifically: how do I separate in my pattern a group of any words from one of specific words that should follow, like state abbrev. or street "type ("st., ave.)?

Anyhow: here is an example of what I am hoping to get: Given def ex_addr(text): # does the re magic # returns 1st address (all addresses?) or None if nothing found

for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',

'Hi there,\n How about meeting tomorr. @10am-sh in Chadds @ 123 S. Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. @10am-sh in Chadds @ 123 S. Vancouver avenue in Ottawa? \nThanks!!!',

'This was written in 1999 in Montreal',

"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",

"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a beer!"
] print ex_addr(t)

I would like to get:

'22 West Westin st., South Carolina, 12345'
'22 West Westin street, SC, 12345'
'123 S. Vancouver ave. in Ottawa'
'123 S. Vancouver avenue in Ottawa'
None # for 'This was written in 1999 in Montreal',
"420 Funny Lane, Cupertino CA",
"12321 Mammoth Lane, Lexington MA 77777"

Could you please help?

Soluzione 2

\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?

In this regex, you have one too many spaces (before ( \w+){1,5}, which already begins with one). Removing it, it matches your example.

I don't think you can assume that a "unit 123" or similar will be there, or there might be several ones (e.g. "building A, apt 3"). Note that in your initial regex, the . might match , which could lead to very long (and unwanted) matches. You should probably accept several such groups with a limitation on the number (e.g. replace , (.*) with something like (, [^,]{1,20}){0,5}.

In any case, you will probably never get something 100% accurate that will accept any variation people might throw at them. Do lots of tests! Good luck.

Altri suggerimenti

I just ran across this in GitHub as I am having a similar problem. Appears to work and be more robust than your current solution.

https://github.com/madisonmay/CommonRegex

Looking at the code, the regex for street address accounts for many more scenarios. '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow