Explaining performance difference in two nearly identical algorithms

Question 1

I guess that the additional time is due to the copying of matrices within the vector. With the times you give, one pass through the data takes 20 or 170 ms, which is on the right order of magnitude for a lot of copying.

Remember that, even though the overhead of copying due to reallocations of the vector is linear, every inserted matrix is copied twice on average, once during insertion, and once during reallocation. In conjunction with the cache clobbering effect of copying a large amount of data, this can produce the additional runtime.

Now you might say: But I'm also copying the matrices when I pass them to the recursive call, shouldn't I expect the first algorithm to take at most three times the time of the second one?
The answer is, that any recursive decent perfectly cache friendly if it is not hampered by cache utilization for data on the heap. Thus, almost all the copying done in the recursive decent does not even reach the L2 cache. If you clobber your entire cache from time to time by doing a vector reallocation, you will resume with an entirely cold cache afterwards.

Question 2

The culprit here is probably temporal locality. You CPU cache is only so big, so when you save off everything after each run and come back to it later, it has left your CPU caches in the meanwhile and takes longer (10s to 100s of cycles) to access. With the second method, it's right there in L1 (or possibly MMX registers) and takes only a cycle or two to access.

In optimization, you generally want to think like the Wu-Tang Clan: Cache Rules Everything Around Me.

Some people have done testing on this, and copies in cache are often much cheaper than dereferences into main memory.

Question 3

Strictly speaking a vector doesn't have to double each growth, it just needs to grow geometrically to supply the required amortized constant time.

In this case if you have sufficiently large number of matrices the growth and required data copies could still be the issue. Or it could be swapping to allocate enough memory. The only way to know for sure is to profile on the system where you experience this difference.