select
addr,
regexp_substr(addr, '^(.*?)\s\D+$', 1, 1, '', 1) street_number,
regexp_substr(addr, '^.*?\s+(\D*?)\s*$', 1, 1, '', 1) street_name
from t1
where -- don't show rejected
regexp_like(addr, '\d.*\s(street|road|rd|ave|avenue|hwy)\s*$', 'i')
how do I split the street values into atomic in pl/sql?
-
21-04-2022 - |
Domanda
My task is to convert the non-atomic (Australian)street addresses into atomic which means current street data is stored as street number and street names together. Samples are:
24 George street -----------> 24 | George street
55 park rd -----------> 55 | pard rd
102a gordon road -----------> 102a | gordon road
unit 5/46 addison ave -----------> unit 5/46 | addison ave
flat 2-9/87 north avenue-----------> flat 2-9/87| north avenue
suit 5 lvl2/55 prince hwy-------> suit 5 lvl2/55| prince hwy
shop 5 Big Shopping Centre ------> Rejected
Suit 2 Level 100 -------> Rejected
added data(the way the program should work):
Darling street ------------------> Rejected
City road -----------------------> rejected
the suggested code processed result:
Darling street ------------> Darling | Street
City road ---------------> City | road
actually in this case the code should not process the address and throw an exception.
What is the best way of splitting the addresses?
Soluzione 2
Altri suggerimenti
I assume you have already seen the answers to this question along the same lines.
@kaᵠ pointed out already that the program doesn't know anything about the data, thus it has no context. This will always be the case. So, with that in mind, the first thing you need to determine is what level of accuracy do you need? If you need 70% accuracy then you can do that with simple REGEX. (Is regex EVER really simple)?
If you need certainty that the addresses you extracted from the input are actually real and valid, you need a list or table to compare against. That data would come from a source like Australia Post (or USPS in the United States).
So, use your regex to extract "guesses" and then verify those against a master list and the ones that match are good. Without the master list, you can't be sure that you got it right or that you got it wrong.
I have actually been working on this exact same issue at SmartyStreets (except I only deal with US addreses) and have come up with a number of different solutions - different ways to determine the beginning and end of the address string, as well as how to deal with false positives, or primary numbers that look just like a postal code. You can go pure REGEX or you can also use tables containing the postal codes, states, and also street names. This enables you to get very close to being able to extract the atomic data with high accuracy.