Domanda

My task is to convert the non-atomic (Australian)street addresses into atomic which means current street data is stored as street number and street names together. Samples are:

24 George street        -----------> 24         |   George street    
55 park rd              -----------> 55         |   pard rd  
102a gordon road        -----------> 102a       |   gordon road
unit 5/46 addison ave   -----------> unit 5/46  |   addison ave 
flat 2-9/87 north avenue-----------> flat 2-9/87|   north avenue
suit 5 lvl2/55 prince hwy-------> suit 5 lvl2/55|   prince hwy
shop 5 Big Shopping Centre  ------> Rejected
Suit 2 Level 100          -------> Rejected

added data(the way the program should work):

Darling street ------------------> Rejected
City road -----------------------> rejected

the suggested code processed result:

Darling street ------------>   Darling     |    Street
City road   --------------->   City        |     road

actually in this case the code should not process the address and throw an exception.

What is the best way of splitting the addresses?

È stato utile?

Soluzione 2

select
   addr,
   regexp_substr(addr, '^(.*?)\s\D+$', 1, 1, '', 1) street_number,
   regexp_substr(addr, '^.*?\s+(\D*?)\s*$', 1, 1, '', 1) street_name
from t1   
where -- don't show rejected
   regexp_like(addr, '\d.*\s(street|road|rd|ave|avenue|hwy)\s*$', 'i')  

fiddle

Altri suggerimenti

I assume you have already seen the answers to this question along the same lines.

@kaᵠ pointed out already that the program doesn't know anything about the data, thus it has no context. This will always be the case. So, with that in mind, the first thing you need to determine is what level of accuracy do you need? If you need 70% accuracy then you can do that with simple REGEX. (Is regex EVER really simple)?

If you need certainty that the addresses you extracted from the input are actually real and valid, you need a list or table to compare against. That data would come from a source like Australia Post (or USPS in the United States).

So, use your regex to extract "guesses" and then verify those against a master list and the ones that match are good. Without the master list, you can't be sure that you got it right or that you got it wrong.

I have actually been working on this exact same issue at SmartyStreets (except I only deal with US addreses) and have come up with a number of different solutions - different ways to determine the beginning and end of the address string, as well as how to deal with false positives, or primary numbers that look just like a postal code. You can go pure REGEX or you can also use tables containing the postal codes, states, and also street names. This enables you to get very close to being able to extract the atomic data with high accuracy.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top