Question

I have been reading a paper Finding Maximal Pairs with Bounded Gap:

An in there, there is a sentence (page 6 second paragraph):

The “smaller-half trick” is used in several methods for finding tandem repeats, e.g. [2, 5, 26]. It says that the sum over all nodes v in an arbitrary binary tree of size n of terms that are $O(n_{1})$, where $n_{1}\leq n_{2}$ are the numbers of leaves in the subtrees rooted at the two children of v, is $O(n\log n)$

Although it sounds very straightforward, I just cannot see why this is true. Could someone with some experience regarding this trick please explain why this is true? I read several other papers but just don't see it.

Was it helpful?

Solution

We prove by induction on $n$ that your sum (with $O(n_1)$ replaced with $n_1$, only taken over non-leaves) is at most $n\log_2 n$. We use the convention $0\log_2 0 = 0$. The base case $n$ = 0 trivially holds, since the sum is empty. Now suppose we have a node with subtrees of sizes $n_1 \leq n_2$; this implies that $n = n_1 + n_2 + 1$. The sum is at most $$ S = n_1\log n_1 + n_2\log n_2 + n_1. $$ Let's concentrate on the first two terms: $$ \begin{align*} n_1\log n_1 + n_2\log n_2 &= (n_1 + n_2) \left[ \frac{n_1}{n_1+n_2} \log n_1 + \frac{n_2}{n_1+n2} \log n_2 \right] \\ &= (n_1 + n_2) \left[ \frac{n_1}{n_1+n_2} \log \frac{n_1}{n_1+n_2} + \frac{n_2}{n_1+n_2} \log \frac{n_2}{n_1+n_2} + \log (n_1+n_2) \right] \\ &= (n_1+n_2) [\log(n_1+n_2) - h(\tfrac{n_1}{n_1+n_2})], \end{align*} $$ where $h(p)$ is the binary entropy function. Let $p = n_1/(n_1+n_2) \leq 1/2$. Then we can bound $S$ by $$ n \log n + (n_1+n_2)(p-h(p)). $$ One checks that $p \leq h(p)$ for $0 \leq p \leq 1/2$, completing the proof. (In fact, $2p \leq h(p)$, so we can get the better bound $\tfrac{1}{2}n\log_2 n$, which is nearly tight for complete binary trees.)

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top