Where does the base come from in a subsequence count?
-
29-09-2020 - |
Question
If sequence X
has m
symbols, then X
has 2m possible subsequences.
Where does the base (2
) come from?
The origin of this question is the following paragraph:
We are given two ordered sequences of symbols, and we wish to find a longest common subsequence of $X$ and $Y$.
A subsequence of $X$ is just $X$ with some (or possibly all or none) of its elements removed. For example, one subsequence of
A; B; C; D; E; F; G
would beB; C; E; G
.The length of a longest common subsequence of $X$ and $Y$ gives one measure of how similar these two sequences are. For example, if the two sequences are base pairs in DNA strands, then we might consider them similar if they have a long common subsequence.
If $X$ has $m$ symbols and $Y$ has $n$ symbols, then $X$ and $Y$ have $2^m$ and $2^n$ possible subsequences respectively.
Solution
You can think of a subsequence $Y$ of $X$ as a characteristic vector $\chi_Y$ that has $m$ elements, one per element of $X$. The $i$-th element of $\chi_Y$ is $1$ if the $i$-th element of $X$ is also in $Y$ and $0$ otherwise.
It is easy to see that this is a bijection: every subsequence $Y$ has a distinct characteristic vector $\chi_Y$, and every characteristic vector $\chi_Y$ is associated to a distinct sequence $Y$.
This means that the number of subsequences $Y$ is equal to the number of vectors $\chi_Y$. Since each vector $\chi_Y$ has $m$ elements, each of which can have one of $2$ possible values, this number is $2^m$.
A concrete example of a sequence $X$, a subsequence $Y$, and its characteristic vector $\chi_Y$ is the following: