Domanda

I know it can't be perfect but I am not very good with regex and I'm having difficulties getting a better matching percentage.

I have a file that has over 9 million rows and the addresses are very inconsistent. I was wondering if I could get some help from the people here that are better than me. Any help would be greatly appreciated.

This is what I have so far. I thought the best way to attack this would be to try to match the pattern from the end of the string since apt,bx, po box, etc could be at the start of the string.

/(\d+\-\d+\s+|\d+-\D+|APT\s\D|APT\s\d+|APT\s\D\d+|APT\s\D\s\d+|SPACE\s\d+|POBOX\s\d+|BX|UNIT\s\d+|\d+-\d+|\d+)\s(.+)\s{2,}(\D+)\s(\D{2})$/

Several patterns that I can see. The large number of spaces is as in the file. I tried splitting on 2 spaces or more as well as in the regex I have thus far.

F_NAME L_NAMEFOR F_NAME L_NAME          ADDRESS ZIP         CITY STATE

ADDRESS        CITY STATE

ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY STATE

APT #               ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY STATE

P O BOX #             ADDRESS        CITY STATE

APT DIGIT#         ADDRESS CITY STATE 

SPACE DIGIT    ADDRESS      CITY STATE

UNIT #         ADDRESS     CITY STATE

SP DIGIT          ADDRESS      CITY STATE

DIGITS-DIGITS ADDRESS       CITY STATE

BX DIGIT       ADDRESS         CITY STATE

ADDRESS     APT #      CITY STATE

ADDRESS       UNIT #     CITY STATE

ADDRESS   P O BOX   DIGIT     CITY STATE

P O B O X    DIGIT      CITY STATE

P O BOX DIGIT    CITY      STATE

ADDRESS    SPACE/SP/SPC/UNIT DIGIT     CITY STATE
È stato utile?

Soluzione

This is a rather complex problem which sadly won't have a simple solution.

You could try the following regex admittedly far from perfect:

^.*?(?<address>(?:\b(?:[a-zA-Z0-9.,:;\\\/#-]|\s(?=\S))*?(?<zip>\d{5}(?:-\d{4}|-\d{6})?)?\b)?)\s{2,}(?<city>\b(?:\w|\s(?=\S))+\b)\s{1,}(?<state>\b\w{2,3}\b)(?:$|\r|\n)

enter image description here

In the image, group 1 = address; group 2 = zip; group 3 = city; group 4 = state

Input, note I changed STATE to st; zip to 12345; and po box digits to actual digits

F_NAME L_NAMEFOR F_NAME L_NAME          ADDRESS 12345         CITY st
ADDRESS        CITY st
ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
APT #               ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
P O BOX # 1234            ADDRESS        CITY st
APT DIGIT#         ADDRESS CITY st
SPACE DIGIT    ADDRESS      CITY st
UNIT #         ADDRESS     CITY st
SP DIGIT          ADDRESS      CITY st
DIGITS-DIGITS ADDRESS       CITY st
BX DIGIT       ADDRESS         CITY st
ADDRESS     APT #      CITY st
ADDRESS       UNIT #     CITY st
ADDRESS   P O BOX   3245     CITY st
P O B O X    123      CITY st
P O BOX 345    CITY      st
ADDRESS    SPACE/SP/SPC/UNIT DIGIT     CITY st

Matches

[0] => Array
(
    [0] => F_NAME L_NAMEFOR F_NAME L_NAME          ADDRESS 12345         CITY st
    [1] => ADDRESS        CITY st
    [2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
    [3] => APT #               ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
    [4] => P O BOX # 1234            ADDRESS        CITY st
    [5] => APT DIGIT#         ADDRESS CITY st
    [6] => SPACE DIGIT    ADDRESS      CITY st
    [7] => UNIT #         ADDRESS     CITY st
    [8] => SP DIGIT          ADDRESS      CITY st
    [9] => DIGITS-DIGITS ADDRESS       CITY st
    [10] => BX DIGIT       ADDRESS         CITY st
    [11] => ADDRESS     APT #      CITY st
    [12] => ADDRESS       UNIT #     CITY st
    [13] => ADDRESS   P O BOX   DIGIT     CITY st
    [14] => P O B O X    123      CITY st
    [15] => P O BOX 345    CITY      st
    [16] => ADDRESS    SPACE/SP/SPC/UNIT DIGIT     CITY st
)

[address] => Array
(
    [0] => ADDRESS 12345
    [1] => ADDRESS
    [2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
    [3] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
    [4] => ADDRESS
    [5] => APT DIGIT#
    [6] => ADDRESS
    [7] => ADDRESS
    [8] => ADDRESS
    [9] => DIGITS-DIGITS ADDRESS
    [10] => ADDRESS
    [11] => APT #
    [12] => UNIT #
    [13] => DIGIT
    [14] => 123
    [15] => P O BOX 345
    [16] => SPACE/SP/SPC/UNIT DIGIT
)

[zip] => Array
    (
        [0] => 12345
        [1] => 
        [2] => 
        [3] => 
        [4] => 
        [5] => 
        [6] => 
        [7] => 
        [8] => 
        [9] => 
        [10] => 
        [11] => 
        [12] => 
        [13] => 
        [14] => 
        [15] => 
        [16] => 
    )

[city] => Array
(
    [0] => CITY
    [1] => CITY
    [2] => CITY
    [3] => CITY
    [4] => CITY
    [5] => ADDRESS CITY
    [6] => CITY
    [7] => CITY
    [8] => CITY
    [9] => CITY
    [10] => CITY
    [11] => CITY
    [12] => CITY
    [13] => CITY
    [14] => CITY
    [15] => CITY
    [16] => CITY
)


[state] => Array
(
    [0] => st
    [1] => st
    [2] => st
    [3] => st
    [4] => st
    [5] => st
    [6] => st
    [7] => st
    [8] => st
    [9] => st
    [10] => st
    [11] => st
    [12] => st
    [13] => st
    [14] => st
    [15] => st
    [16] => st
)

Recommend having a look at question 11160192

Altri suggerimenti

Denomales' answer is quite sufficient for your needs I think, but I'm going to expand my comment above into an answer since I think there are some relevant pieces specific to your question.

Are they US addresses? You could try an API or tool to extract the addresses en-masse. Here's an example of such a tool from another Stack Overflow answer recently, which had a small list of addresses to match:

enter image description here

For disclosure, I work at SmartyStreets and helped to develop this. While it's not designed specifically with spreadsheet or tabular address data in mind, it was designed for non-uniform input like freeform text. You can even splice millions of rows into the service in pieces.

Perhaps this will be helpful as it validates the addresses too, after it finds them in text. Addresses are real gnarly, as you're discovering, and a dedicated tool can sometimes be the best way to handle them. Not saying this is the correct answer for your case, but hopefully still informative.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top