How can two hierarchies be efficiently merged in SQL Server?

https://stackoverflow.com/questions/7058306

23-12-2020
|

题

I have two tables with hierarchyid fields, one of which is a staging table with new data that needs to be merged into the other (that is, a set of nodes that need to be added to the main tree, some of which might already be there).

In addition to the hierarchyid column that defines the tree structure (parent/child relationships). each table has a separate column which contains a node identifier that uniquely identifies each node. That is, the way to tell if a node from the staging table is already in the main table is via the node ID, not via the hierarchyid columns.

Imperatively, the processing that needs to be performed would look something like this:

For each row, RS, in the staging table:
    If there is not already a row with the same Id as RS in the main table:
         Find the parent, PS, of the staging row
         Find the row, PM, in the main table that has the same node ID as PS
         Create a new child, RM of row PM
         Set PM's ID equal to the ID of RS

Importantly, this approach will only work if the tree in the staging table is sorted/traversed in breadth-first order - this is so that when RS is encountered, it is guaranteed that its parent PS already has a corresponding row in the main table.

So far the only way I can see to achieve this in SQL server is to use a cursor over the staging table (which is already sorted) and call a stored procedure for each row that essentially does exactly what is described above, complete with a SELECT MAX() to find the highest hierarchyid that already exists as a child of PM, so that the child can be added uniquely.

This is an awfully inefficient approach, though, and waaaay too slow for my purposes. Is there any better way?

For background, this is kind of a feasibility check I'm doing. I need to figure out whether I can perform this operation quickly inside SQL Server. If it turns out I can't I will have to do it another way, outside the database. The merging of the trees is inherent to (actually, in a sense kind of is) the problem domain, so structuring the data differently or taking a wider view and trying to somehow avoid performing this operation altogether is not an option.

Update

As requested, here's a concrete example.

Tables "staging" and "main" both have the same two columns:

   hierarchy_id of type hierarchyid
   node_id of type bigint

Initial contents

main:

 hierarchy_id    node_id
 /1/             1
 /1/1/           2
 /1/2/           3
 /1/3/           4

staging:

 hierarchy_id    node_id
 /1/             1
 /1/1/           3
 /1/2/           5
 /1/1/1/         6

Desired contents

main:

 hierarchy_id    node_id
 /1/             1
 /1/1/           2
 /1/2/           3
 /1/3/           4
 /1/4/           5
 /1/2/1/         6

Note that the node in the staging table with hierarchy_id /1/1/ corresponds to that with hiearchy_id /1/2/ in the target table (which is why the node_id is important - can't just copy hierarchy_id values across). Also note that the new node with node_id 6 is added as a child of the correct parent, the one with node_id 3, which is why the hierarchy_id is important - it defines the tree structure (parent/child relationships) for any new nodes. Any solution needs to take account of both aspects.

解决方案

Modeling your hierarchy in this way will lead to problems. The hierarchy_id column violates first normal form and the process for merging is going to be prone to update anomalies if you don't serialize/bottleneck access.

You should consider a table with just node_id and parent_id, see how it trivializes your merge problem

node_id   parent_id
1         NULL
2         1
3         2
4         3

node_id   parent_id
1         NULL
3         1
5         2
6         1

You would use recursive queries with this and you might be surprised how efficient the execution plans turn out. If you must have the flattened hierarchy column you can probably create an indexed view with a recursive query.

其他提示

We have been working on a product that required a similar solution. After a lot of research on this and other approaches, we concluded that hierarchyID method is not for us.

So as a direct answer to your question: There is no better way to do this using this approach.

Take a look at Nested Set Models and at Adjacency List Model.

Both of these are far more elegant and efficient solutions to this particular design challenge.

Edit: As an after-thought, in-case you are not married to SQL - this problem can be much better solved using a non-relational database. We could not go that way since no one has enough expertise in designing non-relational databases, but in case SQL is optional then you can use your current approach in a much nicer, more efficient way in MongoDB for example.

Here is a solution that moves the rows from source @S to target @T one level at a time. To simplify a bit I added a root node just to always have a parent present that is used when creating the new HierarcyID.

I have never used HierarchyID so there might absolutely be more efficient ways to do this but it should at least be more efficient than doing it one row at a time.

-- Target table
declare @T table 
(
  hierarchy_id hierarchyid primary key,
  node_id bigint
)

insert into @T values
('/',             0), -- Needed for simplicity
('/1/',           1),
('/1/1/',         2),
('/1/2/',         3),
('/1/3/',         4)

-- Source table
declare @S table 
(
  hierarchy_id hierarchyid primary key,
  node_id bigint
)

insert into @S values
('/',               0),
('/1/',             1),
('/1/1/',           3),
('/1/2/',           5),
('/1/1/1/',         6)

declare @lvl int = 1

-- Move rows from @S to @T for each level
while exists(select *
             from @S
             where hierarchy_id.GetLevel() = @lvl)
begin

  insert into @T
  select T.hierarchy_id.GetDescendant(C.MaxID, null),
         S.node_id
  from (select S1.node_id, S2.node_id as ParentID
        from @S as S1
          inner join @S as S2
            on S1.hierarchy_id.GetAncestor(1) = S2.hierarchy_id
        where S1.hierarchy_id.GetLevel() = @lvl and
              S1.node_id not in (select node_id
                                 from @T)
       ) as S
    inner join @T as T
      on S.ParentID = T.node_id
    outer apply (select max(hierarchy_id) as MaxID
                 from @T as T2
                 where T.hierarchy_id = T2.hierarchy_id.GetAncestor(1)) as C       

    set @lvl = @lvl + 1
end

select *, hierarchy_id.ToString()
from @T
where hierarchy_id <> hierarchyid::GetRoot()

Result:

hierarchy_id  node_id  (No column name)
------------  -------  ----------------
0x58          1        /1/
0x5AC0        2        /1/1/
0x5B40        3        /1/2/
0x5B56        6        /1/2/1/
0x5BC0        4        /1/3/
0x5C20        5        /1/4/

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow