Efficient compression of unlabeled trees

https://cs.stackexchange.com/questions/168

16-10-2019
|

Question

Consider unlabeled, rooted binary trees. We can compress such trees: whenever there are pointers to subtrees $T$ and $T'$ with $T = T'$ (interpreting $=$ as structural equality), we store (w.l.o.g.) $T$ and replace all pointers to $T'$ with pointers to $T$. See uli's answer for an example.

Give an algorithm that takes a tree in the above sense as input and computes the (minimal) number of nodes that remain after compression. The algorithm should run in time $\cal{O}(n\log n)$ (in the uniform cost model) with $n$ the number of nodes in the input.

This has been an exam question and I have not been able to come up with a nice solution, nor have I seen one.

Solution

Yes, you can perform this compression in $O(n \log n)$ time, but it is not easy :) We first make some observations and then present the algorithm. We assume the tree is initially not compressed - this is not really needed but makes analysis easier.

Firstly, we characterize 'structural equality' inductively. Let $T$ and $T'$ be two (sub)trees. If $T$ and $T'$ are both the null trees (having no vertices at all), they are structurally equivalent. If $T$ and $T'$ are both not null trees, then they are structurally equivalent iff their left children are structurally equivalent and their right children are structurally equivalent. 'Structural equivalence' is the minimal fixed point over these definitions.

For example, any two leaf nodes are structurally equivalent, as they both have the null trees as both their children, which are structurally equivalent.

As it is rather annoying to say 'their left children are structurally equivalent and so are their right children', we will often say 'their children are structurally equivalent' and intend the same. Also note we sometimes say 'this vertex' when we mean 'the subtree rooted at this vertex'.

The above definition immediately gives us a hint how to perform the compression: if we know the structural equivalence of all subtrees with depth at most $d$, then we can easily compute the structural equivalence of the subtrees with depth $d+1$. We do have to do this computation in a smart way to avoid a $O(n^2)$ running time.

The algorithm will assign identifiers to every vertex during its execution. An identifier is a number in the set $\{ 1, 2, 3, \dots, n \}$. Identifiers are unique and never change: we therefore assume we set some (global) variable to 1 at the start of the algorithm, and every time we assign an identifier to some vertex, we assign the current value of that variable to the vertex and increment the value of that variable.

We first transform the input tree into (at most $n$) lists containing vertices of equal depth, together with a pointer to their parent. This is easily done in $O(n)$ time.

We first compress all the leaves (we can find these leaves in the list with vertices of depth 0) into a single vertex. We assign this vertex an identifier. Compression of two vertices is done by redirecting the parent of either vertex to point to the other vertex instead.

We make two observations: firstly, any vertex has children of strictly smaller depth, and secondly, if we have performed compression on all vertices of depth smaller than $d$ (and have given them identifiers), then two vertices of depth $d$ are structurally equivalent and can be compressed iff the identifiers of their children coincide. This last observation follows from the following argument: two vertices are structurally equivalent iff their children are structurally equivalent, and after compression this means their pointers are pointing to the same children, which in turn means the identifiers of their children are equal.

We iterate through all the lists with nodes of equal depth from small depth to large depth. For every level we create a list of integer pairs, where every pair corresponds to the identifiers of the children of some vertex on that level. We have that two vertices in that level are structurally equivalent iff their corresponding integer pairs are equal. Using lexicographic ordering, we can sort these and obtain the sets of integer pairs that are equal. We compress these sets to single vertices as above and give them identifiers.

The above observations prove that this approach works and results in the compressed tree. The total running time is $O(n)$ plus the time needed to sort the lists we create. As the total number of integer pairs we create is $n$, this gives us that the total running time is $O(n \log n)$, as required. Counting how many nodes we have left at the end of the procedure is trivial (just look at how many identifiers we have handed out).

OTHER TIPS

Compressing a non-mutable data structure so that it does not duplicate any structurally equal subterm is known as hash consing. This is an important technique in memory management in functional programming. Hash consing is a sort of systematic memoization for data structures.

We're going to hash-cons the tree and count the nodes after hash consing. Hash consing a data structure of size $n$ can always be done in $O(n\:\mathrm{lg}(n))$ operations; counting the number of nodes at the end is linear in the number of nodes.

I will consider trees as having the following structure (written here in Haskell syntax):

data Tree = Leaf
          | Node Tree Tree

For each constructor, we need to maintain a mapping from its possible arguments to the result of applying the constructor to these arguments. Leaves are trivial. For nodes, we maintain a finite partial map $\mathtt{nodes} : T \times T \to N$ where $T$ is the set of tree identifiers and $N$ is the set of node identifiers; $T = N \uplus \{\ell\}$ where $\ell$ is the sole leaf identifier. (In concrete terms, an identifier is a pointer to a memory block.)

We can use a logarithmic-time data structure for nodes, such as a balanced binary search tree. Below I'll call lookup nodes the operation that looks up a key in the nodes data structure, and insert nodes the operation that adds a value under a fresh key and returns that key.

Now we traverse the tree and add the nodes as we go along. Although I'm writing in Haskell-like pseudocode, I'll treat nodes as a global mutable variable; we'll only ever be adding to it, but the insertions need to be threaded throughout. The add function recurses on a tree, adding its subtrees to the nodes map, and returns the identifier of the root.

insert (p1,p2) =
add Leaf = $\ell$
add (Node t1 t2) =
    let p1 = add t1
    let p2 = add t2
    case lookup nodes (p1,p2) of
      Nothing -> insert nodes (p1,p2)
      Just p -> p

The number of insert calls, which is also the final size of the nodes data structure, is the number of nodes after maximum compression. (Add one for the empty tree if needed.)

Here is another idea that aims at (injectively) encoding the structure of trees into numbers, rather than just labelling them arbitrarily. For that, we use that any number's prime factorisation is unique.

For our purposes, let $E$ denote an empty position in the tree, and $N(l,r)$ a node with left subtree $l$ and right subtree $r$. $N(E,E)$ would be a leaf. Now, let

$\qquad \displaystyle \begin{align*} f(E) &= 0 \\ f(N(l,r)) &= 2^{f(l)}\cdot 3^{f(r)} \end{align*}$

Using $f$, we can compute the set of all subtrees contained in a tree bottom-up; in every node, we merge the sets of encodings obtained from the children and add a new number (which can be computed in constant time from the children's encodings).

This last assumption is a stretch on real machines; in this case, one would prefer to use something similar to Cantor's pairing function instead of exponentiation.

The runtime of this algorithm depends on the structure of the tree (on balanced trees, $\cal{O}(n \log n)$ with any set implementation that allows union in linear time). For general trees, we would need logarithmic time union with a simple analysis. Maybe a sophisticated analysis can help, though; note that the usual worst-case tree, the linear list, admits $\cal{O}(n \log n)$ time here, so it is not so clear what the worst-case may be.

As pictures are not allowed in comments:

enter image description here

top left: an input tree

top right: the subtrees rooted in nodes 5 and 7 are isomorphic too.

lower left and right: the compressed trees are not uniquely defined.

Note that in this case the size of the tree has gone down from $7+5|T|$ too $6+|T|$.

Edit: I read the question as T and T′ were children of the same parent. I took the definition of compression to be recursive as well, meaning you could compress two previously compressed subtrees. If that's not the actual question, then my answer may not work.

$O(n \log n)$ begs for a $T(n) = 2T(n/2) + cn$ divide and conquer solution. Recursively compress nodes and compute the number of descendants in each subtree after compression. Here's some python-esque pseudo code.

def Comp(T):
   if T == null:
     return 0
   leftCount = Comp(T.left)
   rightCount = Comp(T.right)
   if leftCount == rightCount:
     if hasSameStructure(T.left, T.right):
       T.right = T.left
       return leftCount + 1
     else
       return leftCount + rightCount + 1

Where hasSameStructure() is a function that compares two already compressed subtrees in linear time to see if they have the exact same structure. Writing a linear time recursive function that traverses each and checks if the one subtree has a left child every time the other one does etc. shouldn't be hard.

Let $n_\ell$ and $n_r$ be the sizes of the left and right subtrees respectively (after compression). Then the running time is $$ T(n) = T(n_1) + T(n_2) + O(1) \mbox{ if $n_\ell \neq n_r$} $$ and $$ 2T(n/2) + O(n) \mbox{ otherwise} $$

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange