Understanding Pyparsing for street addresses

https://stackoverflow.com/questions/23523365

17-07-2023
|

Domanda

While searching for ways to build a better address locator for processing a single field address table I came across the Pyparsing module. On the Examples page there is a script called "streetAddressParser" (author unknown) that I've copied in full below. While I've read the documentation and looked at O'Reilly Recursive Decent Parser tutorials I'm still confused about the code for this address parser. I'm aware that this parser would represent just one component of an address locator application, but my Python experience is limited to GIS scripting and I'm struggling to understand certain parts of this code.

First, what is the purpose of defining numbers as "Zero One Two Three...Eleven Twelve Thirteen...Ten Twenty Thirty..."? If we know an address field starts with integers representing the street number why not just extract that as the first token?

Second, why does this script use so many bitwise operators (^, |, ~)? Is this because of performance gains or are they treated differently in the Pyparsing module? Could other operators be used in place of them and produce the same result?

I'm grateful for any guidance offered and I appreciate your patience in reading this. Thank you!

from pyparsing import *

# define number as a set of words
units = oneOf("Zero One Two Three Four Five Six Seven Eight Nine Ten"
          "Eleven Twelve Thirteen Fourteen Fifteen Sixteen Seventeen Eighteen Nineteen",
          caseless=True)
tens = oneOf("Ten Twenty Thirty Forty Fourty Fifty Sixty Seventy Eighty Ninety",caseless=True)
hundred = CaselessLiteral("Hundred")
thousand = CaselessLiteral("Thousand")
OPT_DASH = Optional("-")
numberword = ((( units + OPT_DASH + Optional(thousand) + OPT_DASH + 
                  Optional(units + OPT_DASH + hundred) + OPT_DASH + 
                  Optional(tens)) ^ tens ) 
               + OPT_DASH + Optional(units) )

# number can be any of the forms 123, 21B, 222-A or 23 1/2
housenumber = originalTextFor( numberword | Combine(Word(nums) + 
                    Optional(OPT_DASH + oneOf(list(alphas))+FollowedBy(White()))) + 
                    Optional(OPT_DASH + "1/2")
                    )
numberSuffix = oneOf("st th nd rd").setName("numberSuffix")
streetnumber = originalTextFor( Word(nums) + 
                 Optional(OPT_DASH + "1/2") +
                 Optional(numberSuffix) )

# just a basic word of alpha characters, Maple, Main, etc.
name = ~numberSuffix + Word(alphas)

# types of streets - extend as desired
type_ = Combine( MatchFirst(map(Keyword,"Street St Boulevard Blvd Lane Ln Road Rd Avenue Ave "
                        "Circle Cir Cove Cv Drive Dr Parkway Pkwy Court Ct Square Sq"
                        "Loop Lp".split())) + Optional(".").suppress())

# street name 
nsew = Combine(oneOf("N S E W North South East West NW NE SW SE") + Optional("."))
streetName = (Combine( Optional(nsew) + streetnumber + 
                        Optional("1/2") + 
                        Optional(numberSuffix), joinString=" ", adjacent=False )
                ^ Combine(~numberSuffix + OneOrMore(~type_ + Combine(Word(alphas) + Optional("."))), joinString=" ", adjacent=False) 
                ^ Combine("Avenue" + Word(alphas), joinString=" ", adjacent=False)).setName("streetName")

# PO Box handling
acronym = lambda s : Regex(r"\.?\s*".join(s)+r"\.?")
poBoxRef = ((acronym("PO") | acronym("APO") | acronym("AFP")) + 
             Optional(CaselessLiteral("BOX"))) + Word(alphanums)("boxnumber")

# basic street address
streetReference = streetName.setResultsName("name") + Optional(type_).setResultsName("type")
direct = housenumber.setResultsName("number") + streetReference
intersection = ( streetReference.setResultsName("crossStreet") + 
                 ( '@' | Keyword("and",caseless=True)) +
                 streetReference.setResultsName("street") )
streetAddress = ( poBoxRef("street")
                  ^ direct.setResultsName("street")
                  ^ streetReference.setResultsName("street")
                  ^ intersection )

tests = """\
    3120 De la Cruz Boulevard
    100 South Street
    123 Main
    221B Baker Street
    10 Downing St
    1600 Pennsylvania Ave
    33 1/2 W 42nd St.
    454 N 38 1/2
    21A Deer Run Drive
    256K Memory Lane
    12-1/2 Lincoln
    23N W Loop South
    23 N W Loop South
    25 Main St
    2500 14th St
    12 Bennet Pkwy
    Pearl St
    Bennet Rd and Main St
    19th St
    1500 Deer Creek Lane
    186 Avenue A
    2081 N Webb Rd
    2081 N. Webb Rd
    1515 West 22nd Street
    2029 Stierlin Court
    P.O. Box 33170
    The Landmark @ One Market, Suite 200
    One Market, Suite 200
    One Market
    One Union Square
    One Union Square, Apt 22-C
    """.split("\n")

# how to add Apt, Suite, etc.
suiteRef = (
            oneOf("Suite Ste Apt Apartment Room Rm #", caseless=True) + 
            Optional(".") + 
            Word(alphanums+'-')("suitenumber"))
streetAddress = streetAddress + Optional(Suppress(',') + suiteRef("suite"))

for t in map(str.strip,tests):
    if t:
        #~ print "1234567890"*3
        print t
        addr = streetAddress.parseString(t, parseAll=True)
        #~ # use this version for testing
        #~ addr = streetAddress.parseString(t)
        print "Number:", addr.street.number
        print "Street:", addr.street.name
        print "Type:", addr.street.type
        if addr.street.boxnumber:
            print "Box:", addr.street.boxnumber
        print addr.dump()
        print

Soluzione

In some addresses, the primary number is spelt out as a word, as you can see from a few addresses in their tests, near the end of the list. Your statement, "If we know an address field starts with integers representing the street number..." is a big "if". Many, many addresses do not start with a number.
The bitwise operators are probably used to set flags to classify tokens as having certain properties. For the purpose of setting bits/flags, the bitwise operators are very efficient and convenient.

It's refreshing to see a parser that attempts to parse street addresses without using a regular expression... also see this page about some of the challenges of parsing freeform addresses.

However, it's worth noting that this parser looks like it will miss a wide variety of addresses. It doesn't seem to consider some of the special address formats common in Utah, Wisconsin, and rural areas. It also is missing a significant number of secondary designators and street suffixes.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow