Question

In RavenDb, I have to store hierarchical data and I need to query it recursively. The performance is the biggest concern here.

What I have is similar to the following one:

public class Category
{
    public int Id { get; set; }
    public string Name { get; set; }
    public Category Parent { get; set; }
}

In this case, if I store the parent category inside the document itself, it will hard for me to manage the data as I will duplicating the categories all over the place.

So, to make that easy, I can store this as below:

public class Category
{
    public int Id { get; set; }
    public int? ParentId { get; set; }
    public string Name { get; set; }
}

But in that case I'm not sure how the performance will be here as I will have millions of records and I need to create the category tree from this reference.

Is there a certain decision in RavenDb on how to model this type of data when the performance is the biggest concern?

Was it helpful?

Solution

Hierarchies are usually best modeled in one document that defines the hierarchy. In your situation that would be to define the categories tree, where the categories themselves can be represented by standalone documents (and thus hold Name, Description etc, and allow for other collection to reference them), or not.

Modeled from code a Category document would look something like this:

public class Category
{
    public string Id { get; set; }
    public string Name { get; set; }
    // other meta-data that you want to store per category, like image etc
}

And the hierarchy tree document can be serialized from a class like the following, where this class can have methods for making nodes in it easily accessible:

public class CategoriesHierarchyTree
{
    public class Node
    {
       public string CategoryId { get; set; }
       public List<Node> Children { get; set; }
    }

    public List<Node> RootCategories { get; private set; }

    // various methods for looking up and updating tree structure
}

This approach of hierarchy-tree has several important advantages:

  1. One transactional scope - when the tree changes, the tree changes in one transaction, always. You cannot get affected by multiple concurrent changes to the tree since you can leverage optimistic concurrency when editing this one document. Using the approach you propose it is impossible to guarantee that therefore harder to guarantee the completeness and correctness of the hierarchy tree over time. If you think of a hierarchy as a tree, it actually makes a lot of sense to have each change lock the entire tree until it completes. The hierarchy tree is one entity.
  2. Caching - the entire hierarchy can be quickly and efficiently cached, even using aggressive caching which will minimize the times the server is accessed with queries on hierarchy.
  3. All operations are done entirely in-memory - since its one document, aka object, all queries on the hierarchy (whose the parent of, list of children etc) are made entirely in-memory and effectively cost close to nothing to perform. Using an index with Recurse() to answer such queries is order of magnitude costlier (network costs and computational). You mention performance is the biggest concern - so this is a winner.
  4. Multiple parents per category, no denormalization - if a category document is saved outside the hierarchy tree, like demonstrated above, you can effectively put a category under multiple parents without the need to denormalize. All category data is in one place, in a document outside of the tree, and the tree only holds a reference to the category.

I will highly recommend going with this approach. It is a bit of a shift from the relational mindset, but its so worth it, even when the tree grows big.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top