Processing every leaf under a node in a tree efficiently

https://softwareengineering.stackexchange.com/questions/204987

29-09-2020
|

Question

Short version: In a tree (non-binary) with many levels of children, where each node can have multiple leaves, what is the best way to tally leaves that meet a certain condition given a node?

Long, convoluted version: Say you have Folders, which can contain more Folders. In production, my system has about 15 layers of folders at the deepest. Each folder may have documents.

When a user navigates to a folder, they need to know how many documents are in that folder, total. They also need to know how many of those documents have been "flagged". Documents can be flagged if, after a certain time period, they have not been updated. This is determined based on history entries - take the last history entry for the folder, if it's bigger than X, then it's flagged.

Right now I'm iterating over all folders > subfolders > etc > children to check the history entry and tally recursively. This is getting pretty slow now so I need a new approach.

I've come up with two options but want to know if there's a better approach, or which one of these is best.

option 1: Whenever adding a document or removing a document, iterate up the parent graph and increment / decrement a "DocumentCount" property that's stored in the Folder entity. This means that adding / removing take a minimal performance hit. But the problem here is that I actually need to count total number of documents, as well as total number of documents that are flagged (by checking history entries), so that means that I'll need another field and another process to call when this status changes. This introduces even more complexity as checking whether or not something is flagged is a method call, not a property.

option 2: Have a timer job run through and update the "documentCount" and "unopenedDocumentCount" properties of each folder in the system, would be lots of CPU cycles and seems wasteful.

option 3: Just thought of this - maybe iterating over every document and seeing if the document is a direct or distant child of the target node before evaluating would be better than iterating via recursion. I feel like this would be better for nodes high-up in the tree, but iterating over every child for a low-level node would be ridiculous. Hmmm...

I'm tempted to implement option 1.. what other options are there?

Solution

If you know the condition(s) you want to tally when you add the node to the tree, you can design your add routine to recur down through the tree, instead of just stepping down, and then unwind the recursions and update the tallies in the parent nodes.

Similarly, as you update nodes, you recur down to them, and then unwind the recursions and update the parents.

If you don't know the condition(s) beforehand, you're screwed. You have no choice but to traverse the tree in whatever order makes sense. From what you've said, it sounds as though whatever you're doing at a node is a function of the whatever of the left and right subtrees, which makes a postorder traverse look like the right answer.

Note: If the above does not make sense to you, read Knuth vol. 1.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange