I guess that the additional time is due to the copying of matrices within the vector
. With the times you give, one pass through the data takes 20 or 170 ms, which is on the right order of magnitude for a lot of copying.
Remember that, even though the overhead of copying due to reallocations of the vector is linear, every inserted matrix is copied twice on average, once during insertion, and once during reallocation. In conjunction with the cache clobbering effect of copying a large amount of data, this can produce the additional runtime.
Now you might say: But I'm also copying the matrices when I pass them to the recursive call, shouldn't I expect the first algorithm to take at most three times the time of the second one?
The answer is, that any recursive decent perfectly cache friendly if it is not hampered by cache utilization for data on the heap. Thus, almost all the copying done in the recursive decent does not even reach the L2 cache. If you clobber your entire cache from time to time by doing a vector
reallocation, you will resume with an entirely cold cache afterwards.