Retrieving an incorrectly stored tree data structure

https://softwareengineering.stackexchange.com/questions/307483

11-12-2020
|

Question

Way back in the hazy origins of our platform it was decided that we were going to need some hierarchical data structures stored in the RDBMS. The relationships between nodes were stored via a 'parent_id' column that referenced another row in the same table. Although at face value this may seem slightly sensible, the reality has turned out very differently.

I have now been tasked with implementing some functionality that requires traversing the hierarchy - Specifically, I need to build a list of all of the descendents of a node for a given node. This would be trivial if the relationships were stored as 'parent -> children' but as the relationships are stored as 'child -> (unique) parent' I can't figure out a performant way of doing this (the naive approach is O(n^2) I believe).

Having wrestled with this for a few hours, the current thinking is that we should refactor the database but if anyone has experienced this before it'd be good to hear about your solution. If not, let this be a warning to anyone trying to save a tree in this fashion if you ever plan on having to traverse it!

Solution

It's not necessarily as bad as you think and may even be the most optimal representation if you have an index on parent_id.

Let's say you're interested in the descendants of B above. In that case, you would first query for nodes which have parent_id matching the ID of B. That gives you D,E. Now query for nodes which have the parent ID of either of these. That gives you H,I,J. Do it once more for H,I,J, we get nothing, and so we're done.

Assuming the queries are in constant time (hash index, e.g.), that's actually optimal from an algorithmic complexity standpoint (worst-case having linear complexity). Each query to find the child nodes of a parent node in that case would be O(1). Even if the queries are logarithmic, it's actually optimal in terms of the number of queries (logN) unless you actually memoized the entire list of descendants in each node (which would be explosive in terms of memory use) and definitely not as bad as quadratic complexity as you think.

For an RDBMS, it actually seems very optimal to me to store things in this bottom-up fashion, especially if the tree is n-ary and not merely binary. Don't worry, be happy. I think the only thing storing child IDs in the nodes would save you is the need for an index, but it would amount to the same number of queries and pretty much the same amount of work to get a list of all descendants.

If you need more speed, and the tree fits, you can transfer the tree to memory, linearly building the tree out of nodes in the memory of the client and then just do the search in memory, updating it when the database is modified. But you might not even need that.

OTHER TIPS

With an index on parent_id, I actually think this is a good way to organize the relationship, and we have been using it that way in many places. Also, I see no obvious better way, as the relationship is 1:N, so if you try to store parent->child, you need an extra table, and you end up with the same logic again.

The core point is having the index on parent_id; with that, access is fast and easy, and you could build a real tree in memory if you need it continuously.

There are a number of ways to store a hierarchy in a relational database. These include

Adjacency List. This is your current implementation.
Path Enumeration.
Nested Set
Closure Table

Each has a different insert, delete, neighbour-read and set-read characteristics. The adjacency list has the lowest overhead for rapidly changing hierarchies. It is less good for enumerating sub trees, however. The others pre-calculate each node's descendents in different ways, which makes lookup fast at the cost of additional write amplification.

Depending on your read / write ratio one of these may help.

I have found Joe Celko's book "Trees and Hierarchies in SQL for Smarties" to be most readable and informative.

There's a discussion on Stackoverflow which discusses this widely.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange