Matching mailing addresses against a database

Question 1

This is a problem that can be solved both by lat-long (use R-trees for quick 2-D closest neighbours! Comes as standard in MongoDB, but certainly availiable i Psql among others as well)

There's also the text matching, described here: SO: What are ways to match street addresses in SQL Server?

There seems to be third party products availiable as well: SO: I need an address matching algorithm

If you want to combine these two approaches, look for the term "data fusion", which is a quite disparate collection of methods that essentially put higher weight to answers that are more certain, and bases the final answer on the aggregated certainty.

A description of some Harward Design GIS-project research could be of interest as well: http://www.gsd.harvard.edu/gis/manual/geocoding/

There's a list of all the cities in the world with their corresponding coordinates: http://www.maxmind.com/en/worldcities

Question 2

Here's what I'd considered:

1) Geocode the address on input, store the lat/long. When the user does a search, geocode the address and compare lat/longs to see if I have that exact lat/long in my database.

But there are problems with this.

Storing the results of the Google Geocoder is a violation of their terms of use.
There's a good reason for that; Google constantly updates their geocodes, so a given address's lat/long may change over time.
I'd be performing an exact comparison on floating point numbers, which may not be accurate.
What about multiple apartments within a building? They'll all have the same lat/long, but they're different addresses.

2) Geocode the address on input, but don't store the lat/long; store the address components, and compare those.

This seems better, but there are still problems:

Still violates Geocoder terms of use?
... because Google might change its results. Maybe the address components are less likely to change, but they could still change as people report data errors to Google. (Certainly at least the zip code could change.)

3) Geocode the address, store the lat/long, but don't search for the lat/long exactly. Search within a small radius around the resulting point, looking for possible matches. Compare those possible matches by address components.

This might be the best answer, except that it still violates Google's Geocoder terms of use.

4) Geocode the address on input, get the address components, but just use them to store a parsed normalized postal address in the database.

Add some hand-rolled code to split normalized addresses into even smaller fields (street name, street type, prefix, postfix ...) When the user runs the search, run the same normalization code, then search by fields.

I guess this would work, but rolling my own address parser seems like a recipe for pain. It seems like it just can't possibly be right. (I can't be the first person to need to solve this problem, can I?)

Question 3

You could perhaps use geocoder.us to supplement or replace your use of Google's geocoder. It does a nice job of parsing out the address components; that might help with normalization. There's also a newer version that might be worth looking at to see how it works.