How do Markov Chains work and what is memorylessness?

Question 1

You can visualize Markov chains like a frog hopping from lily pad to lily pad on a pond. The frog does not remember which lily pad(s) it has previously visited. It also has a given probability for leaping from lily pad Ai to lily pad Aj, for all possible combinations of i and j. The Markov chain allows you to calculate the probability of the frog being on a certain lily pad at any given moment.

If the frog was a vegetarian and nibbled on the lily pad each time it landed on it, then the probability of it landing on lily pad Ai from lily pad Aj would also depend on how many times Ai was visited previously. Then, you would not be able to use a Markov chain to model the behavior and thus predict the location of the frog.

Question 2

The idea of memorylessness is fundamental to the success of Markov chains. It does not mean that we don't care about the past. On contrary, it means that we retain only the most relevant information from the past for predicting the future and use that information to define the present. This nice article provides a good background on the subject http://www.americanscientist.org/issues/pub/first-links-in-the-markov-chain

There is a trade-off between the accuracy of your description of the past and the size of the associated state space. Say, there are three pubs in the neighborhood and every evening you choose one. If you choose those pubs randomly, this is not a Markov chain (or a trivial, zero-order one) – the outcome is random. More precisely, it is an independent random variable (modeling dependency was fundamental to Markov ideas underlying Markov chains).

In your choice of pubs you can factor in your last choice, i.e., which pub you went to the night before. For example, you might want to avoid going to the same pub two days in a row. While in reality this implies remembering where you have been yesterday (and thus remembering the past!), at your modeling level, your unit of time is one day, so your current state is the pub you went to yesterday. This is your classical (first-order) Markov model with three states and 3 by 3 transition matrix that provides conditional probabilities for each permutation (if yesterday you went to pub I, what is the change that today you “hop” to pub J).

However, you can define a model where you “remember” the last two days. In this second-order Markov model “present” state will include the knowledge of the pub from last night and from the night before. Now you have 9 possible states describing your present state, and therefore you have a 9 by 9 transition matrix. Thankfully, this matrix is not fully populated.

To understand why, consider a slightly different setup when you are so well-organized that you make firm plans for your pub choices both for today and tomorrow based on the last two visits. Then, you can select any possible permutations of pubs visited for two days in a row. The result is a fully populated 9 by 9 matrix that maps your choices for the last two days into those for the next two days. However, in our original problem we make the decision every day, so our future state is constrained by what happened today: at the next time step (tomorrow) today becomes yesterday, but it will be still a part of your definition of "today" at that time step, and relevant to what happens the following day. The situation is analogous to moving averages, or receding horizon procedures. As a result, from a given state, you can only move to three possible states (indicating your today’s choice of pubs), which means that each row of your transition matrix will have only three non-zero entries.

Let us tally up the number of parameters that characterize each problem: the zero- order Markov model with three states has two independent parameters (probabilities of hitting the first and the second pub, as the probability of visiting the third pub is the complement of the first two). The first-order Markov model has a fully populated 3 by 3 matrix with each column summing up to one (again, indicating that one of the pubs will always be visited at any given day), so we end up with six independent parameters. The second-order Markov model has 9 by 9 matrix with each row having only 3 non-zero entries and all columns adding to one, so we have 18 independent parameters. We can continue defining higher-order models, and our state space will grow accordingly.

Importantly, we can further refine the concept by identifying important features of the past and use only those features to define the present, i.e. compress the information about the past. This is what I referred to in the beginning. For example, instead of remembering all history we can only keep track only of some memorable events that impact our choice, and use this “sufficient statistics” to construct the model.

It all boils down to the way you define relevant variables (state space), and the Markov concepts naturally follow from the underlying fundamental mathematics concepts. First-order (linear) relationships (and associated linear algebra operations) are at the core of most current mathematical applications. You can look at a polynomial equation on n-th with a single variable, or you can define an equivalent first-order (linear) system of n equations by defining auxiliary variables. Similarly, in classical mechanics you can either use second-order Lagrange equations or choose canonical coordinates that lead to (first-order) Hamiltonian formulation http://en.wikipedia.org/wiki/Hamiltonian_mechanics

Finally, a note on the steady-state vs. transient solutions of Markov problems. An overwhelming amount of practical applications (e.g., Page rank) relies on finding steady-state solutions. Indeed, the presence of such convergence to a steady state was the original motivation for A. Markov for creating his chains in an effort to extend the application of central limit theorem to dependent variables. The transient effects (such as hitting times) of Markov processes are significantly less studied and more obscure. However, it is perfectly valid to consider Markov prediction of the outcomes at the specific point in the future (and not only the converged, “equilibrium” solution).