문제

I have already made a post some time ago to ask about a good design for LRU caching (in C++). You can find the question, the answer and some code there:

Better understanding the LRU algorithm

I have now tried to multi-thread this code (using pthread) and came with some really unexpected results. Before even attempting to use locking, I have created a system in which each thread accesses its own cache (see code). I run this code on a 4 cores processor. I tried to run it with 1 thread and 4 thread. When it runs on 1 thread I do 1 million lookups in the cache, on 4 threads, each threads does 250K lookups. I was expecting to get a time reduction with 4 threads but get the opposite. 1 threads runs in 2.2 seconds, 4 threads runs in more than 6 seconds?? I just can't make sense of this result.

Is something wrong with my code? Can this be explained somehow (thread management takes time). It would be great to have the feedback from experts. Thanks a lot -

I compile this code with: c++ -o cache cache.cpp -std=c++0x -O3 -lpthread

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <sys/time.h>

#include <list>

#include <cstdlib>
#include <cstdio>
#include <memory>
#include <list>
#include <unordered_map> 

#include <stdint.h>
#include <iostream>

typedef uint32_t data_key_t;

using namespace std;
//using namespace std::tr1;

class TileData
{
public:
    data_key_t theKey;
    float *data;
    static const uint32_t tileSize = 32;
    static const uint32_t tileDataBlockSize;
    TileData(const data_key_t &key) : theKey(key), data(NULL)
    {
        float *data = new float [tileSize * tileSize * tileSize];
    }
    ~TileData()
    { 
        /* std::cerr << "delete " << theKey << std::endl; */
        if (data) delete [] data;   
    }
};

typedef shared_ptr<TileData> TileDataPtr;   // automatic memory management!

TileDataPtr loadDataFromDisk(const data_key_t &theKey)
{
    return shared_ptr<TileData>(new TileData(theKey));
}

class CacheLRU
{
public:
    list<TileDataPtr> linkedList;
    unordered_map<data_key_t, TileDataPtr> hashMap; 
    CacheLRU() : cacheHit(0), cacheMiss(0) {}
    TileDataPtr getData(data_key_t theKey)
    {
        unordered_map<data_key_t, TileDataPtr>::const_iterator iter = hashMap.find(theKey);
        if (iter != hashMap.end()) {
            TileDataPtr ret = iter->second;
            linkedList.remove(ret);
            linkedList.push_front(ret);
            ++cacheHit;
            return ret;
        }
        else {
            ++cacheMiss;
            TileDataPtr ret = loadDataFromDisk(theKey);
            linkedList.push_front(ret);
            hashMap.insert(make_pair<data_key_t, TileDataPtr>(theKey, ret));
            if (linkedList.size() > MAX_LRU_CACHE_SIZE) {
                const TileDataPtr dropMe = linkedList.back();
                hashMap.erase(dropMe->theKey);
                linkedList.remove(dropMe);
            }
            return ret;
        }

    }
    static const uint32_t MAX_LRU_CACHE_SIZE = 100;
    uint32_t cacheMiss, cacheHit;
};

int numThreads = 1;

void *testCache(void *data)
{
    struct timeval tv1, tv2;
    // Measuring time before starting the threads...
    double t = clock();
    printf("Starting thread, lookups %d\n", (int)(1000000.f / numThreads));
    CacheLRU *cache = new CacheLRU;
    for (uint32_t i = 0; i < (int)(1000000.f / numThreads); ++i) {
        int key = random() % 300;
        TileDataPtr tileDataPtr = cache->getData(key);
    }
    std::cerr << "Time (sec): " << (clock() - t) / CLOCKS_PER_SEC << std::endl;
    delete cache;
}

int main()
{
    int i;
    pthread_t thr[numThreads];
    struct timeval tv1, tv2;
    // Measuring time before starting the threads...
    gettimeofday(&tv1, NULL);
#if 0
    CacheLRU *c1 = new CacheLRU;
    (*testCache)(c1);
#else
    for (int i = 0; i < numThreads; ++i) {
        pthread_create(&thr[i], NULL, testCache, (void*)NULL);
        //pthread_detach(thr[i]);
    }

    for (int i = 0; i < numThreads; ++i) {
        pthread_join(thr[i], NULL);
        //pthread_detach(thr[i]);
    }
#endif  

    // Measuring time after threads finished...
    gettimeofday(&tv2, NULL);

    if (tv1.tv_usec > tv2.tv_usec)
    {
        tv2.tv_sec--;
        tv2.tv_usec += 1000000;
    }

    printf("Result - %ld.%ld\n", tv2.tv_sec - tv1.tv_sec,
           tv2.tv_usec - tv1.tv_usec);

    return 0;
}
도움이 되었습니까?

해결책

A thousand apologies, by keeping debugging the code I realised I made a really bad beginner's mistake, if you look at that code:

TileData(const data_key_t &key) : theKey(key), data(NULL)
{
    float *data = new float [tileSize * tileSize * tileSize];
}

from the TikeData class where data is supposed to actually be a member variable of the class... So the right code should be:

class TileData
{
public:
    float *data;
    TileData(const data_key_t &key) : theKey(key), data(NULL)
    {
        data = new float [tileSize * tileSize * tileSize];
        numAlloc++;
    }
};

I am so sorry about that! It's a mistake I have done in the past, and I guess prototyping is great, but it sometimes lead to do such stupid mistakes. I ran the code with 1 and 4 threads and do now see the speedup. 1 thread takes about 2.3 seconds, 4 threads takes 0.92 seconds. Thanks all for your help, and sorry if I made you lose your time ;-)

다른 팁

I don't have a concrete answer yet. I can think of several possibilities. One is that testCache() is using random(), which is almost certainly implemented with a single global mutex. (Thus all of your threads are competing for the mutex, which is now ping-ponging between the caches.) ((That's assuming that random() is actually thread-safe on your system.))

Next, testCach() is accessing a CacheLRU which is implemented with unordered_maps and shared_ptrs. The unordered_maps, in particular might be implemented with some kind of global mutex underneath that is causing all of your threads to compete for access.

To really diagnose what is going on here you should do something much simpler inside of testCache(). (First try just taking the sqrt() of an input variable 250K times (vs. 1M times). Then try linearly accessing a C array of size 250K (or 1M). Slowly build up to the complex thing you are currently doing.)

Another possibility has to do with the pthread_join. pthread_join doesn't return until all the threads are done. So if one is taking longer than the others, you are measuring the slowest one. Your computation here seems balanced, but perhaps your OS is doing something unexpected? (Like mapping several threads to one core (perhaps because you have a hyper-threaded processor?, or one thread is moving from one core to another in the middle of the run (perhaps because the OS thinks it is smart when it is not.)

This will be a bit of a "build it up" answer. I'm running your code on a Fedora 16 Linux system with a 4-core AMD cpu and 16GB of RAM.

I can confirm that I'm seeing similar "slower with more threads" behaviour. I removed the random function, which doesn't improve things at all.

I'm going to make some other minor changes.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top