LRU Caching & Multithreading

Question 1

A thousand apologies, by keeping debugging the code I realised I made a really bad beginner's mistake, if you look at that code:

TileData(const data_key_t &key) : theKey(key), data(NULL)
{
    float *data = new float [tileSize * tileSize * tileSize];
}

from the TikeData class where data is supposed to actually be a member variable of the class... So the right code should be:

class TileData
{
public:
    float *data;
    TileData(const data_key_t &key) : theKey(key), data(NULL)
    {
        data = new float [tileSize * tileSize * tileSize];
        numAlloc++;
    }
};

I am so sorry about that! It's a mistake I have done in the past, and I guess prototyping is great, but it sometimes lead to do such stupid mistakes. I ran the code with 1 and 4 threads and do now see the speedup. 1 thread takes about 2.3 seconds, 4 threads takes 0.92 seconds. Thanks all for your help, and sorry if I made you lose your time ;-)

Question 2

I don't have a concrete answer yet. I can think of several possibilities. One is that testCache() is using random(), which is almost certainly implemented with a single global mutex. (Thus all of your threads are competing for the mutex, which is now ping-ponging between the caches.) ((That's assuming that random() is actually thread-safe on your system.))

Next, testCach() is accessing a CacheLRU which is implemented with unordered_maps and shared_ptrs. The unordered_maps, in particular might be implemented with some kind of global mutex underneath that is causing all of your threads to compete for access.

To really diagnose what is going on here you should do something much simpler inside of testCache(). (First try just taking the sqrt() of an input variable 250K times (vs. 1M times). Then try linearly accessing a C array of size 250K (or 1M). Slowly build up to the complex thing you are currently doing.)

Another possibility has to do with the pthread_join. pthread_join doesn't return until all the threads are done. So if one is taking longer than the others, you are measuring the slowest one. Your computation here seems balanced, but perhaps your OS is doing something unexpected? (Like mapping several threads to one core (perhaps because you have a hyper-threaded processor?, or one thread is moving from one core to another in the middle of the run (perhaps because the OS thinks it is smart when it is not.)

Question 3

This will be a bit of a "build it up" answer. I'm running your code on a Fedora 16 Linux system with a 4-core AMD cpu and 16GB of RAM.

I can confirm that I'm seeing similar "slower with more threads" behaviour. I removed the random function, which doesn't improve things at all.

I'm going to make some other minor changes.