Question

Using Ruby (newb) and Regex, I'm trying to parse the street number from the street address. I'm not having trouble with the easy ones, but I need some help on:

'6223 1/2 S FIGUEROA ST' ==> 'S FIGUEROA ST'

Thanks for the help!!

UPDATE(s):

'6223 1/2 2ND ST' ==> '2ND ST'

and from @pesto '221B Baker Street' ==> 'Baker Street'

Was it helpful?

Solution

This will strip anything at the front of the string until it hits a letter:

street_name = address.gsub(/^[^a-zA-Z]*/, '')

If it's possible to have something like "221B Baker Street", then you have to use something more complex. This should work:

street_name = address.gsub(/^((\d[a-zA-Z])|[^a-zA-Z])*/, '')

OTHER TIPS

Group matching:

.*\d\s(.*)

If you need to also take into account apartment numbers:

.*\d.*?\s(.*)

Which would take care of 123A Street Name

That should strip the numbers at the front (and the space) so long as there are no other numbers in the string. Just capture the first group (.*)

There's another stackoverflow set of answers: Parse usable Street Address, City, State, Zip from a string

I think the google/yahoo decoder approach is best, but depends on how often/many addresses you're talking about - otherwise the selected answer would probably be the best

Can street names be numbers as well? E.g.

1234 45TH ST

or even

1234 45 ST

You could deal with the first case above, but the second is difficult.

I would split the address on spaces, skip any leading components that do not contain a letter and then join the remainder. I do not know Ruby, but here is a Perl example which also highlights the problem with my approach:

#!/usr/bin/perl

use strict;
use warnings;

my @addrs = (
    '6223 1/2 S FIGUEROA ST',
    '1234 45TH ST',
    '1234 45 ST',
);

for my $addr ( @addrs ) {
    my @parts = split / /, $addr;

    while ( @parts ) {
        my $part = shift @parts;
        if ( $part =~ /[A-Z]/ ) {
            print join(' ', $part, @parts), "\n";
            last;
        }
    }
}

C:\Temp> skip
S FIGUEROA ST
45TH ST
ST

Ouch! Parsing an address by itself can be extremely nasty unless you're working with standardized addresses. The reason for this that the "primary number" which is often called the house number can be at various locations within the string, for example:

  1. RR 2 Box 15 (RR can also be Rural Route, HC, HCR, etc.)
  2. PO Box 17
  3. 12B-7A
  4. NW95E235
  5. etc.

It's not a trivial undertacking. Depending upon the needs of your application, you're best bet to get accurate information is to utilize an address verification web service. There are a handful of providers that offer this capability.

In the interest of full disclosure, I'm the founder of SmartyStreets. We have an address verification web service API that will validate and standardize your address to make sure it's real and allow you to get the primary/house number portion. You're more than welcome to contact me personally with questions.

/[^\d]+$/ will also match the same thing, except without using a capture group.

For future reference a great tool to help with regex is http://www.rubular.com/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top