Find whether a tree is a subtree of other

https://stackoverflow.com/questions/1017821

06-07-2019
|

Question

There are two binary trees T1 and T2 which store character data, duplicates allowed.
How can I find whether T2 is a subtree of T1 ? .
T1 has millions of nodes and T2 has hundreds of nodes.

Solution

Traverse T1. If the current node is equal to the root node of T2, traverse both of the trees (T2 and the current subtree of T1) at the same time. Compare the current node. If they are always equal, T2 is a subtree of T1.

OTHER TIPS

I suggest an algorithm, that uses preprocessing:

1) Pre-process the T1 tree, keeping the list of all possible subtree roots (the cache list will have a millions of entries);

2) Sort the cached list of roots by the descending order of data, kept in root. You may choose more elegant sorting function, for example, parse a character tree into string.

3) Use binary search to locate the necessary sub-tree. You may quickly reject subtrees, with the number of nodes, not equal to the T2 number of nodes (or with the different depth).

Note that steps 1) and 2) must be done only once, then you can test many sub-tree candidates, using the same preprocessed result.

If your trees are not sorted in any way, I don't see any other way than to do a brute-force search: walk through tree T1 and check to see if you reach a node which matches the first node of tree T2. If not, continue traversing T1. If so, check if the next nodes match, until you find the end of T2, in which case you have a hit: your tree T2 is indeed a subtree of T1.

If you know the depth of every single node of T1 (from leaf to node), you could skip any nodes which are not as deep as the subtree you are looking for. This could help you eliminate a lot of needless comparisons. Say that T1 and T2 are well balanced, then tree T1 will have a total depth of 20 (2**20 > 1,000,000) and tree T2 will have a depth of 7 (2**7 > 100). You'll just have to walk the 13 first layers of T1 (8192 nodes -- or is that 14 layers and 16384 nodes?) and will be able to skip about 90% of T1...

However, if by subtree you mean that leaf nodes of T2 are also leaf nodes of T1, then you could do a first traversal of T1 and compute the depth of every node (distance from leaf to node) and then only check the subtrees which have the same depth as T2.

If you are memory/storage bound (i.e. can't pre-manipulate and store the trees in alternate forms), you probably won't be able to do anything better than the brute force search suggested by some other answers (traverse P1 looking for matching root of P2, traverse both to determine whether the node is in-fact the root of a matching sub-tree, continue with original traversal if it isn't a match). This search operates in O(n * m) where n is the size of P1 and m is the size of P2. With depth checks and other potential optimizations depending on the tree data you have available, this man be optimized a bit, but it's still generally O(n * m). Depending on your specific circumstances, this may be the only reasonable approach.

If you're not memory/storage bound and don't mind a bit of complexity, I believe this could be improved to O(n + m) by reducing to a longest common substring problem with the help of a generalized suffix tree. Some discussion on this for a similar problem can be found here. Maybe when I have more time, I'll come back and edit with more specifics on an implementation.

If given the root of both trees, and given that the nodes are of the same type, why is then just ascertaining that the root of T2 is in T1 not sufficient?

I am assuming that "given a tree T" means given a pointer to the root of T and the data type of the node.

Regards.

I am not sure, whether my idea is correct. Nevertheless, for your persual.

Conduct a postorder walk in Tree 1 & Tree 2, call it P1 and P2.
Compare P1 & P2. If they are different. The tree is not subtree. Exit.
If they are same, or P1 contained in P2. You can decide which one is sub tree.

I think we need to go by brute force, but why do you need to match the trees after you find your root of T2 in T1. Its not same as when we are not supposed to find if the tree are identical.(Then only we need to compare the whole tree)

You are given trees T1 and T2,pointers, not the values.

If Node T2 (which is a pointer), is present in T1 tree.

Then the Tree T2 is subtree of T1.

Edit:

Only if its said that T2 is actually a different Tree by object wise, and we need to find out that if there is a Subtree in T1, which is identical to T2.

Then this wont work.

And we have no other option than to go for the solutions already discussed here.

Let us say we have T1 as parent tree and T2 as a tree which might be a subtree of T1. Do the following. Assumption made is T1 and T2 are binary tree without any balancing factor.

1) Search the root of T2 in T1. If not found T2 is not a subtree. Searching the element in BT will take O(n) time.

2) If the element is found, do pre-order traversal of T1 from the node root element of T2 is found. This will take o(n) time. Do pre-order traversal of T2 as well. Will take O(n) time. Result of pre-order traversal can be stored into a stack. Insertion in stack will take only O(1).

3) If size of two stacks is not equal, T2 is not a subtree.

4) Pop one element from each stack and check for equality. If mismatch occurs, T2 is not a subtree.

5) If all elments matched T2 is a subtree.

I assume that your tree are immutable trees so you never change any subtree (you don't do set-car! in Scheme parlance), but just you are constructing new trees from leaves or from existing trees.

Then I would advise to keep in every node (or subtree) an hash code of that node. In C parlance, declare the tree-s to be

 struct tree_st {
   const unsigned hash;
   const bool isleaf;
   union {
     const char*leafstring; // when isleaf is true
     struct { // when isleaf is false
        const struct tree_st* left;
        const struct tree_st* right;
     };
   };
 };

then compute the hash at construction time, and when comparing nodes for equality first compare their hash for equality; most of the time the hash code would be different (and you won't bother comparing content).

Here is a possible leaf construction function:

struct tree_st* make_leaf (const char*string) {
   assert (string != NULL);
   struct tree_st* t = malloc(sizeof(struct tree_st));
   if (!t) { perror("malloc"); exit(EXIT_FAILURE); };
   t->hash = hash_of_string(string);
   t->isleaf = true;
   t->leafstring = string;
   return t;
}

The function to compute an hash code is

unsigned tree_hash(const struct tree_st *t) {
  return (t==NULL)?0:t->hash;
}

The function to construct a node from two subtrees sleft & sright is

struct tree_st*make_node (const struct tree_st* sleft,
                          const struct tree_st* sright) {
   struct tree_st* t = malloc(sizeof(struct tree_st));
   if (!t) { perror("malloc"); exit(EXIT_FAILURE); };
   /// some hashing composition, e.g.
   unsigned h = (tree_hash(sleft)*313) ^ (tree_hash(sright)*617);
   t->hash = h;
   t->left = sleft;
   t->right = sright;
   return t;
 }

The compare function (of two trees tx & ty) take advantage that if the hashcodes are different the comparands are different

bool equal_tree (const struct tree_st* tx, const struct tree_st* ty) {
  if (tx==ty) return true;
  if (tree_hash(tx) != tree_hash(ty)) return false;
  if (!tx || !ty) return false;
  if (tx->isleaf != ty->isleaf) return false;
  if (tx->isleaf) return !strcmp(tx->leafstring, ty->leafstring);
  else return equal_tree(tx->left, ty->left) 
              && equal_tree(tx->right, ty->right);

}

Most of the time the tree_hash(tx) != tree_hash(ty) test would succeed and you won't have to recurse.

Read about hash consing.

Once you have such an efficient hash-based equal_tree function you could use the techniques mentioned in other answers (or in handbooks).

One of the plain way is to write is_equal() method for tree and do the following,

bool contains_subtree(TNode*other) {
    // optimization
    if(nchildren < other->nchildren) return false;
    if(height < other->height) return false;

    // go for real check
    return is_equal(other) || (left != NULL && left->contains_subtree(other)) || (right != NULL && right->contains_subtree(other));
}

Note that is_equal() can be optimized by using hashcode for the tree. It can be done in simple way by taking height of the tree or number of children or range of the values as hashcode.

bool is_equal(TNode*other) {
    if(x != other->x) return false;
    if(height != other->height) return false;
    if(nchildren != other->nchildren) return false;
    if(hashcode() != other->hashcode()) return false;
    // do other checking for example check if the children are equal ..
}

When the tree is similar to a linked list, it will take O(n) time. We can also use some heuristic while choosing the children to compare.

bool contains_subtree(TNode*other) {
    // optimization
    if(nchildren < other->nchildren) return false;
    if(height < other->height) return false;

    // go for real check
    if(is_equal(other)) return true;
    if(left == NULL || right == NULL) {
          return (left != NULL && left->contains_subtree(other)) || (right != NULL && right->contains_subtree(other));
    }
    if(left->nchildren < right->nchildren) { // find in smaller child tree first
          return (left->contains_subtree(other)) || right->contains_subtree(other);
    } else {
          return (right->contains_subtree(other)) || left->contains_subtree(other);
    }
}

Another way is to serialize both tree as string and find if the second string(serialized from T2) is sub-string of the first string(serialized from T1).

The following code serializes in pre-order.

   void serialize(ostream&strm) {
            strm << x << '(';
            if(left)
                    left->serialize(strm);
            strm << ',';
            if(right)
                    right->serialize(strm);
            strm << ')';
    }

And we can use some optimized algorithm, for example, Knuth–Morris–Pratt algorithm to find(possibly in O(n) time) the existence of the sub-string and eventually find if a tree is a sub-tree of other .

Again the string can be compressed efficiently with Burrows–Wheeler_transform. And it is possible to bzgrep to search sub-string in the compressed data.

Another way is to sort the sub-trees in the tree by height and number of children.

bool compare(TNode*other) {
    if(height != other->height)
        return height < other->height;

    return nchildren < other->nchildren;
}

Note that there will be O(n^2) sub-trees. To reduce the number we can use some range based on the height. For example, we can only be interested about the sub-trees of height 1000 to 1500.

When the sorted data is generated it is possible to do binary search in it and find if it is subset in O(lg n) time(considering that there is no duplicate in sorted data) .

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow