Optimizing National ZIP+4 database for fast address lookup

https://stackoverflow.com/questions/17091323

31-05-2022
|

Domanda

I've just obtained a large set of text files (8 GB total) containing all of the address ranges within the U.S. The set consists of:

929 ZIP+4 files, each containing postal addresses of unique three-digit zip code. For example, file 606 would only contain addresses that have a five-digit zip code that begins with 606. The total number of records among these files are approximately 30 million.
City State file, containing a comprehensive list of zip codes and their corresponding city and state.

The City State Key can be used to join the City State file to the ZIP+4 files.

Given the size of the database and my lack of experience, I wanted to get some insight before beginning this endeavor. Should the ZIP+4 files be merged into one monster file and then indexed using zip code, or left separated by three-digit zip code so that the three-digit zip code file name can be used as a block matching criteria? If it is the latter, then wouldn't this be a hierarchical database model? Can I accommodate relationships with the City State file using a hierarchical model?

The above description of the data set is a vast simplification, but for the purposes of this question, a detailed description is unnecessary. A complete description can be found here.

I'm using Python and have not decided on an RDBMS yet. Any help would be much appreciated!

Soluzione

If you're going to use a RDBMS, you will eventually have the contents of all 929 files in one database, most probably in more than table. I cannot tell you much more about the design of such a database since you don't provide enough detail about the contents of each of those files. The exact layout will be a normalized form of your 30 million rows in probably a handful of tables. The performance of modern RDBMS is good enough to handle data of that scale if (and only if) your indexes are properly set.

There is very little reason not to put that data into a RDBMS. The only reason I could think of is to eliminate the need of such a subsystem entirely, e.g. to simplify deployment of your solution. If you actually consider doing that then yes, a set of 929 files could act as a hierarchical database. The main difference to a RDBMS solution is that with such a set of flat files you can only reasonably query your data by one key - that being your zip code (or any part thereof).

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow