Question

I want to implement a taxonomy structure (geo terms) for my node.js application with NoSQL database. I had a similar taxonomy structure with MySQL but it's time to move forward and learn something new so I decided to try a different approach and use NoSQL (document-oriented) for my test app. The taxonomy structure is simple - there're five different levels: country (i.e. United Kingdom) → region (England) → county (Merseyside) → city/town/village (Liverpool) → part of the city (Toxteth).

The obvious choice is to use a tree structure but the devil is in the detail - historically some cities and towns belonged to other counties. The idea was to tag persons who were born in certain cities or towns with those terms and to filter them later by geo tags so I have to respect the fact Liverpool or Manchester (among others) were part of Lancashire at the time some people were born. Otherwise the result any user gets with my geo filter will be incorrect.

Example: John Doe was born in Blackburn (Lancashire) back in 1957. Paul Brown was born in 1960 in Liverpool (Lancashire, now Merseyside). Georgia Doe (nee Jones) was born in Wirral (Cheshire, now Merseyside) 5 years later. Their son Ringo was born in Liverpool (Merseyside by that time) in 1982.

John is Lancastrian by birth, Paul is Lancastrian and Merseysider, Georgia is from Cheshire and Merseyside at the same time, Ringo is from Merseyside. So they should be categorized accordingly when I search by county. But with simple one-to-many structure that follows modern structure of the country they'll never be filtered as they should be.

How to implement the collection respecting the complexity of its structure with NoSQL (first of all document-oriented) solutions? I googled it and did some research over stack* but still had no clue what to do next with it. There's a few possible ways to solve it in my opinion:

  1. Use SQL-like data structure:

    {
        {'name': 'United Kingdom', 'unique_id': 1},
        {'name': 'England', 'unique_id': 2, 'parents': [1]},
        {'name': 'Merseyside', 'unique_id': 3, 'parents': [2]},
        {'name': 'Lancashire', 'unique_id': 4, 'parents': [2]},
        {'name': 'Liverpool', 'unique_id': 5, 'parents': [3, 4]},
    }
    
  2. Use tree structure with some references:

    {    
        {'name': 'United Kingdom', 'unique_id': 1
            {'name': 'England', 'unique_id': 2]
                {'name': 'Merseyside', 'unique_id': 3]
                    {'name': 'Liverpool', 'unique_id': 5, 'alternate_parents': [4]},
                },
                {'name': 'Lancashire', 'unique_id': 4},
            },
        },
    }
    
  3. Use tree structure with no references (one-to-many) and add "alternate parent" tag to a document manually:

    {    
        {'name': 'United Kingdom', 'unique_id': 1
            {'name': 'England', 'unique_id': 2]
                {'name': 'Merseyside', 'unique_id': 3]
                    {'name': 'Liverpool', 'unique_id': 5},
                },
                {'name': 'Lancashire', 'unique_id': 4},
            },
        },
    }
    
  4. Stick with SQL.

  5. Try to implement database-less taxonomy.

Give me advice on that matter please. I'm a newby with any NoSQL (currently I've designed no such databases) so there's a real design issue for me.

And I'm new to stack* so feel free to correct me if I did anything wrong with this post :) Thank you!

EDIT I've chosen @Jonathan answer as a solution. I think it suits better for my needs (there'll be other documents to store in my database and tag them with those terms) especially with mapReduce functionality suggested by @Valentyn.

But if there's no document collections needed for your app a graph database (based on relationships not documents) suggested by @Philipp is probably the best solution possible.

Was it helpful?

Solution 2

Firstly, picking between NoSQL and a SQL database is hard if you're not familiar with the basic principles. If this is the only data you are storing, go with a relational (SQL). If there is more data (which I assume) and it requires more of a interwoven schema, stick with NoSQL hands down.

I would take the relational route on this to keep it from getting too complex... start several collections; one for countries, region and so on. Don't get discouraged from doing relational (SQL) type schemas in a NoSQL database; most of the time they are the best solution.

Then, in each of the sub-groups, have a field which names the parent.

For example:

{
    {'name': 'United Kingdom'},
    {'name': 'United States'}
}

{
    {'name': 'England', 'parent': 'United Kingdom'},
    {'name': 'California', 'parent': 'United States'}
}

That way, your data-set doesn't get so nested that the returned data is unmanageable. Then you can grab the countries and the corresponding regions... etc with ease.

Best of luck!

EDIT: Answering OP's questions:

(Firstly, I'd recommend MongoDB - it's a great solution all around.)

  1. Because when you start working with MongoDB, you'll realize that it stores data side by side on the hard drive. If you edit a huge record like that, it will most likely be pushed to the back of the disk, making your hard drive similar to Swiss cheese. Once you get to that point, you'll have to do a repair to condense it once more. Also, this way the data is more easily separated in your application, that way, if you need to do something with the data, you won't have to apply it to the entire object. I am assuming that you will have a large dataset since there are many different locations in the world.

  2. Don't worry too much about that kind of thing. You can use ID's for the parent and match the children with the ID if you plan on changing names a lot. I just did it this way because I assumed you wouldn't need to change a location database.

  3. Rather than an array, I would use a nested document to store multiple parents. That way, it can be more easily queried and indexed. I would use the following method:

    {
        {
            'name': 'England,
            'parent': {
                1: 1,
                568: 1
            }
         }
     }
    

So that way you can employ your idea of indexes and find where db.region.$.568 = 1

OTHER TIPS

Because of a comment you made, I assume that you mean "MongoDB" when you say "NoSQL". There are a lot of other database technologies commonly referred to as NoSQL which are completely different, but this one seems to be the one you mean.

  1. is not a good idea, because to get the whole taxonomy chain you will need to do multiple database queries, which should generally be avoided.

  2. and 3. A single document which is a huge tree is not a good idea either, because MongoDB has a limit of 16MB per document. When you create huge, monolithic documents, you might hit that limit.

I think that MongoDB might not be the best solution for your use-case. Did you consider using a graph database? MongoDB is optimized for self-contained documents which stand on their own. But the focus of graph databases is on datasets where you have a lot of entities which are defined by their relations to other entities. This looks a lot like your use-case.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top