How to model data for a CouchDB geocoder

https://stackoverflow.com/questions/22483663

16-06-2023
|

Question

I am working on a CouchDB based geocoding application using a large national dataset that is supplied relationally. There are some 250 million records split over 9 tables (The ER Diagram can be viewed at http://bit.ly/1dlgZBt). I am quite new to nosql document databases and CouchDB in particular and am considering how to model this. I have currently loaded the data into a CouchDB database per table with a type field indicating which kind of record it is. The _id attribute is set to be the primary key for table [A] and [C], for everything else it is auto-generated by Couch. I plan on setting up Lucene with Couch for indexing and full text search. The X and Y Point coordinates are all stored in table [A] but to find these I will need to search using data in [Table E], [Tables B, C & D combined] and/or [Table I] with the option of filtering results based on data in [Table F].

My original intention was to create a single CouchDB database which would combine all of these tables into a single structure with [Table A] as the root and all related tables nested under this. I would then build my various search indexes on this and also setup a spatial index using GeoCouch for reverse geocoding. However I have read articles that suggest view collation as an alternative approach.

An important factor here I guess is reads vs writes. The plan is that this data will never be updated, only read. Data is released every quarter at which time the existing DB would be blown away and a new DB created.

I would welcome any suggestions for how best to setup and organise this from any experienced Couch or related document database users.

Many thanks in advance for any assistance.

Solution

guygrange,

While I am far from an expert in document database design, the key thing to recognize about documents DBs is that everything is about making your queries fast by keeping all of the necessary information in a single document. Hence, you need to look at your queries and how you expect to access this data. For example, I can easily imagine a geocoding application to not need access to everything in each table for your most frequent queries. Hence, to save on bandwidth, you would make a main document that has the main information you most frequently care about along with a key for the rest of the appropriate data. Then you could fetch the remaining data with that key and merge the dictionaries for easy management in your client code.

Anon, Andrew

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow