How can I parse an address into its individual components?

Question 1

Okay, so the Address field was quite painful to parse. However I did manage to parse the data based on my particular requirements.

The Address always has a <br> between the Street & City.

So I did the following:

var splitBasedOnHTML = Regex.Split(column[2], @"\br<br>");

The column[] contains my address in index two. So after that call, it will automatically position my Unit and Street in Index Zero. The City, State, and Zip will be located Index One.

So I did another split, to break the City, State, and Zip like this:

var splitBasedOnSpace = splitBasedOnHtml[1].Split(' ');

After that I now end up with the following:

6313 SW 203rd Ave // splitBasedOnHtml[0]
Portland, // splitBasedonSpace[0]
OR // splitBasedOnSpace[1]
97224 // splitBasedOnSpace[2]

So I simply mapped my properties to those individual array index's.

This solution makes the assumption that the Unit is apart of the Street, which become an okay sacrifice as the data is being imported into another web-site and can be modified by particular people later on.

That is how I solved the parse issues, this solution may not be viable for others in this boat but hopefully this is a nice alternative or points in a good direction. What the method looks like:

    public static Map AddressMapper(IList<string> column)
    {
        var map = new Map();
        var splitBasedOnHTML = Regex.Split(column[2], @"\b<br>");
        var splitBasedOnSpace = splitBasedOnHTML[1].Split(' ');

        map.Street = splitBasedOnHTML[0];
        map.City = splitBasedOnSpace[0].Replace(@",", " ");
        map.State = splitBasedOnSpace[1];
        map.Zip = spliteBasedOnSpace[2];

        return map;
    }

Question 2

Parsing addresses is hard; really hard. There is no truely uniform format for addresses, especially across country borders. It's highly unlikely that you will be able to do this using a single RegEx.

See this other post for a few examples and a more in-depth explanation. How to parse freeform street/postal address out of text, and into components

Question 3

There's a limit to what can be done with regular expressions, however here's an example that assumes your addresses always respects this format. If you cannot ensure that your addresses will respect a specific format (enforced by your domain), you will have to rely on some more complex solutions like what's discussed in the other answer.

Also have a look at Parse usable Street Address, City, State, Zip from a string

EDIT: I'm sorry, I forgot this was a C# question... but you get the picture.

var parseAddress = (function (rx) {
  return function parseAddress(html) { 
      var matches = html.match(rx);
      return {
          unit: matches[1],
          street: matches[2],
          city: matches[3],
          state: matches[4],
          zip: matches[5]
      };
  };
})(/^(\d*)\s*(.+?)\s*<br>\s*(.+?),\s*(.+?)\s*(\d+)$/);

parseAddress('6313 SW 203rd Ave <br> Portland, OR 97224');
//Object {unit: "6313", street: "SW 203rd Ave", city: "Portland", state: "OR", zip: "97224"}

Question 4

If you get rid of the html tags, there is powerful open-source library libpostal that fits for this use case very nicely. There are bindings to different programming languages. Libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere.

I have created a simple Docker image with Python binding pypostal you can spin off and try very easily pypostal-docker