Range sum query - tree representation efficiency

https://cs.stackexchange.com/questions/130547

29-09-2020
|

Pergunta

I was reading about possible solutions to the well known problem:

Given array A with length N create a structure that enables

Answering what is the sum $\sum_{k=i}^{j} A[k]$

Updating $A[k]$

I've seen most solutions use binary index tree but was curios whether it's possible to just use a regular tree that is built using similar qualities.

So given $A = [5, 4, 7, 9, 1]$

I try to construct a tree by creating a tree node for each value that has a start and end (which are just the index in the beginning.

To construct the tree I push all the starting node into a queue $Q$

while not Q.empty():
  next <- []
  for i in range(Q.size()):
    f <- Q.front()
    Q.pop()
    if Q.empty():
      if marker:
        parent <- make_parent(f, marker)
        next.push(parent)
      else:
        marker <- f
    else:
      f2 <- Q.front()
      Q.pop()
      parent <- make_parent(f, marker)
      next.push(parent)
   
  for n in next:
     Q.push(n)

After the this ends marker will hold the root

(I have working c++ code but I tried to provide something more abstract and simple)

and to get a sum of range I perform the following (assuming I have an array Nodes that holds all the leaves) and that the query starts with the root of the tree that we created above

sumRangeNode(int i, int j, Node* n) 
  if i == j 
    return Nodes[i]
  if n == null 
    return 0
  if j < n->start || i > n->end 
    return 0;
                
  if i <= n->start && j >= n->end 
    return n->val
        
  return sumRangeNode(i, j, n->left) + sumRangeNode(i, j, n->right)

The question is does it still have the $\log(N)$ complexity, I've tried to reason about it but struggled with:

The fact that I might be building a tree with "stragglers" like the $1$ in the example
The fact that I recursively explore right and left. Intuition tells me that because there are "enough" cases where the descent is stopped it's OK but couldn't find a way to formalize/prove it.

Solução

This approach does work. A "Binary-Indexed-Tree" is just an implicit balanced binary tree with some extra "compression". See this answer for the details on how and why this is done, but essentially the purpose is just to run in fewer cycles.

Your approach works fine asymptotically and will probably even perform fine in practice.

First, is the resulting tree produced by your queue always guaranteed to be "short"? The answer is yes. The easiest way to see this is notice that your queue's use is naturally into "epochs", where each epoch builds roughly one level of the tree.

First, as a simplification, don't bother tracking the marked separately from the others. Instead, just repeat the loop until Q has only one element, and then let that be your root. This simplifies your state and makes it easier to follow.

while q.size() > 1:
   next = []
   while q.size() > 0:
       a := q.pop()
       if q.empty():
           next.push(a)
       else:
           b := q.pop()
           next.push(make_parent(a, b))
   for v in next:
       q.push(v)
root := q.pop()

If the queue contains all $n$ before an iteration, then afterwards it will contain either $\lfloor n / 2 \rfloor$ or $\lfloor n / 2 \rfloor + 1$ nodes depending on whether $n$ is is even or odd.

In addition, the maximum height of the nodes in the array will increase by at most one, since each node is only paired with one other (a careful analysis will show it always increases by exactly one, but we don't actually care).

The first observation tells us that the loop will run for $O(\log n)$ iterations. The second therefore tells us that the resulting tree will have height at most $O(\log n)$ since the height of the trees in the queue increase by at most 1 every iteration.

So the tree may not be perfectly "balanced", but its height is still bounded by $O(\log n)$ and thus any operation that runs in time bounded by the tree's height will run in $O(\log n)$ time.

The fact that I recursively explore right and left. Intuition tells me that because there are "enough" cases where the descent is stopped it's OK but couldn't find a way to formalize/prove it.

This is actually straightforward. Suppose your tree has nodes in some levels that look the below, where the [--...--] indicate the range of values in the original array that the node represents.

Corresponding queries will be drawn with (...) instead. There are (generically) only a few kinds to worry about:


    [---------------45-----------------]
    [------------30----------][---15---]
  (- - Q1 - -)
           (- - Q2 - -)
                        (- - Q3 - -)
                        (- - - - - Q4 - - - -)
  (- - - - - - - - Q5 - - - - - - - - - - -)

Note that Q2 and Q4 have a symmetric forms that I'm omitting.

Q1 only traverses the left half
Q2 only traverses the left half
Q3 traverses both halves
Q4 traverses both halves
Q5 covers the entire range, so it won't traverse either

Thus we only really need to analyze Q3 and Q4.

Q4 totally covers the right half, so that "traversal" is guaranteed to be constant. So only the left half is traverses to any non-constant depth.

Thus the only "interesting" case is the Q3-type query, which only partially crosses the left child range and the right child range.

Note that the type-Q3 query extends to the left edge of the right child, and to the right edge of the left child.

This means that in the subtrees, there will be no type-Q3 queries, which can't extend to either edge. All queries will either completely miss each node, completely cover them, or extend past an edge.

So there will be at most one type-Q3 query in the entire tree (note that type Q5 also check both children, but that they are "fast").

Since there can only be one of them, you can essentially ignore them in the analysis - at most $O(\log n) + O(\log n) = O(\log n)$ nodes will be visited.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange