سؤال

This question is rather vague and I don't really need an answer to it, but I am very curious about what the answer might be, so I'll ask it anyway.

I have an algorithm which generates a huge amount of matrices. It later runs a second algorithm over it which generates a solution. I ran it 100 times and it took an average of ~17 seconds.

The second algorithm does nearly exactly the same, with the only difference being, that the second algorithm is run over each matrix as soon as it is generated, so that they actually never need to be stored anywhere. This variant obviously needs much less space, which is why I made it, but it also needs an average of only ~2 seconds for the same problem.

I didn't expect it to run faster, especially not that much.

The code is quite big, so I will try to outline the difference in something resembling pseudo-code:

recursiveFill(vector<Matrix> &cache, Matrix permutation) {
  while(!stopCondition) {
    // generate next matrix from current permutation
    if(success)
      cache.push_back(permutation);
    else
      recursiveFill(cache, permutation);
    // some more code
  }
}

recursiveCheck(Matrix permutation) {
  while(!stopCondition) {
    // alter the matrix some
    if(success)
      checkAlgorithm(permutation);
    else
      recursiveCheck(permutation);
    // some more code
  }
}

After the recursive fill, a loop runs the checkAlgorithm over all elements in the cache. Everything I didn't include in the code is identical in both algorithms. I guessed that the storing in the vector is what eats up all the time, but if i recall correctly, the size of a c++ vector doubles each time it is overfilled, so a reallocation shouldn't happen too often. Any ideas?

هل كانت مفيدة؟

المحلول

I guess that the additional time is due to the copying of matrices within the vector. With the times you give, one pass through the data takes 20 or 170 ms, which is on the right order of magnitude for a lot of copying.

Remember that, even though the overhead of copying due to reallocations of the vector is linear, every inserted matrix is copied twice on average, once during insertion, and once during reallocation. In conjunction with the cache clobbering effect of copying a large amount of data, this can produce the additional runtime.

Now you might say: But I'm also copying the matrices when I pass them to the recursive call, shouldn't I expect the first algorithm to take at most three times the time of the second one?
The answer is, that any recursive decent perfectly cache friendly if it is not hampered by cache utilization for data on the heap. Thus, almost all the copying done in the recursive decent does not even reach the L2 cache. If you clobber your entire cache from time to time by doing a vector reallocation, you will resume with an entirely cold cache afterwards.

نصائح أخرى

The culprit here is probably temporal locality. You CPU cache is only so big, so when you save off everything after each run and come back to it later, it has left your CPU caches in the meanwhile and takes longer (10s to 100s of cycles) to access. With the second method, it's right there in L1 (or possibly MMX registers) and takes only a cycle or two to access.

In optimization, you generally want to think like the Wu-Tang Clan: Cache Rules Everything Around Me.

Some people have done testing on this, and copies in cache are often much cheaper than dereferences into main memory.

Strictly speaking a vector doesn't have to double each growth, it just needs to grow geometrically to supply the required amortized constant time.

In this case if you have sufficiently large number of matrices the growth and required data copies could still be the issue. Or it could be swapping to allocate enough memory. The only way to know for sure is to profile on the system where you experience this difference.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top