Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

https://stackoverflow.com/questions/7259363

17-01-2021
|

Question

I've tried to measure the asymmetric memory access effects of NUMA, and failed.

The Experiment

Performed on an Intel Xeon X5570 @ 2.93GHz, 2 CPUs, 8 cores.

On a thread pinned to core 0, I allocate an array x of size 10,000,000 bytes on core 0's NUMA node with numa_alloc_local. Then I iterate over array x 50 times and read and write each byte in the array. Measure the elapsed time to do the 50 iterations.

Then, on each of the other cores in my server, I pin a new thread and again measure the elapsed time to do 50 iterations of reading and writing to every byte in array x.

Array x is large to minimize cache effects. We want to measure the speed when the CPU has to go all the way to RAM to load and store, not when caches are helping.

There are two NUMA nodes in my server, so I would expect the cores that have affinity on the same node in which array x is allocated to have faster read/write speed. I'm not seeing that.

Why?

Perhaps NUMA is only relevant on systems with > 8-12 cores, as I've seen suggested elsewhere?

http://lse.sourceforge.net/numa/faq/

numatest.cpp

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>

void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}

void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    void* y = numa_alloc_local(N);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)y)[j];
            ((char*)y)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;

    *x = y;
}

void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)x)[j];
            ((char*)x)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}

int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();

    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);

    void* x;
    size_t N(10000000);
    size_t M(50);

    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();

    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }

    numa_free(x, N);

    return 0;
}

The Output

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp

./numatest

numa_available() 0                    <-- NUMA is available on this system
numa node 0 10101010 12884901888      <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064      <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0

Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928

Doing 50 iterations reading and writing over array x takes about 1.7 seconds, no matter which core is doing the reading and writing.

Update:

The cache size on my CPUs is 8Mb, so maybe 10Mb array x is not big enough to eliminate cache effecs. I tried 100Mb array x, and I've tried issuing a full memory fence with __sync_synchronize() inside my innermost loops. It still doesn't reveal any asymmetry between NUMA nodes.

Update 2:

I've tried reading and writing to array x with __sync_fetch_and_add(). Still nothing.

Solution 2

Ah hah! Mysticial is right! Somehow, hardware pre-fetching is optimizing my read/writes.

If it were a cache optimization, then forcing a memory barrier would defeat the optimization:

c = __sync_fetch_and_add(((char*)x) + j, 1);

but that doesn't make any difference. What does make a difference is multiplying my iterator index by prime 1009 to defeat the pre-fetching optimization:

*(((char*)x) + ((j * 1009) % N)) += 1;

With that change, the NUMA asymmetry is clearly revealed:

numa_available() 0
numa node 0 10101010 12884901888
numa node 1 01010101 12874584064
Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725
Elapsed read/write by thread on core 0: 00:00:00.942300
Elapsed read/write by thread on core 1: 00:00:01.216286
Elapsed read/write by thread on core 2: 00:00:00.909353
Elapsed read/write by thread on core 3: 00:00:01.218935
Elapsed read/write by thread on core 4: 00:00:00.898107
Elapsed read/write by thread on core 5: 00:00:01.211413
Elapsed read/write by thread on core 6: 00:00:00.898021
Elapsed read/write by thread on core 7: 00:00:01.207114

At least I think that's what's going on.

Thanks Mysticial!

EDIT: CONCLUSION ~133%

For anyone who is just glancing at this post to get a rough idea of the performance characteristics of NUMA, here is the bottom line according to my tests:

Memory access to a non-local NUMA node has about 1.33 times the latency of memory access to a local node.

OTHER TIPS

The first thing I want to point out is that you might want to double-check which cores are on each node. I don't recall cores and nodes being interleaved like that. Also, you should have 16 threads due to HT. (unless you disabled it)

Another thing:

The socket 1366 Xeon machines are only slightly NUMA. So it will be hard to see the difference. The NUMA effect is much more noticeable on the 4P Opterons.

On systems like yours, the node-to-node bandwidth is actually faster than the CPU-to-memory bandwidth. Since your access pattern is completely sequential, you are getting the full bandwidth regardless of whether or not the data is local. A better thing to measure is the latency. Try random accessing a block of 1 GB instead of streaming it sequentially.

Last thing:

Depending on how aggressively your compiler optimizes, your loop might be optimized out since it doesn't do anything:

c = ((char*)x)[j];
((char*)x)[j] = c;

Something like this will guarantee that it won't be eliminated by the compiler:

((char*)x)[j] += 1;

Thanks for this benchmark code. I've taken your 'fixed' version and changed it to pure C + OpenMP and added a few tests for how the memory system behaves under contention. You can find the new code here.

Here are some sample results from a Quad Opteron:

num cpus: 32
numa available: 0
numa node 0 10001000100010000000000000000000 - 15.9904 GiB
numa node 1 00000000000000001000100010001000 - 16 GiB
numa node 2 00010001000100010000000000000000 - 16 GiB
numa node 3 00000000000000000001000100010001 - 16 GiB
numa node 4 00100010001000100000000000000000 - 16 GiB
numa node 5 00000000000000000010001000100010 - 16 GiB
numa node 6 01000100010001000000000000000000 - 16 GiB
numa node 7 00000000000000000100010001000100 - 16 GiB

sequential core 0 -> core 0 : BW 4189.87 MB/s
sequential core 1 -> core 0 : BW 2409.1 MB/s
sequential core 2 -> core 0 : BW 2495.61 MB/s
sequential core 3 -> core 0 : BW 2474.62 MB/s
sequential core 4 -> core 0 : BW 4244.45 MB/s
sequential core 5 -> core 0 : BW 2378.34 MB/s
sequential core 6 -> core 0 : BW 2442.93 MB/s
sequential core 7 -> core 0 : BW 2468.61 MB/s
sequential core 8 -> core 0 : BW 4220.48 MB/s
sequential core 9 -> core 0 : BW 2442.88 MB/s
sequential core 10 -> core 0 : BW 2388.11 MB/s
sequential core 11 -> core 0 : BW 2481.87 MB/s
sequential core 12 -> core 0 : BW 4273.42 MB/s
sequential core 13 -> core 0 : BW 2381.28 MB/s
sequential core 14 -> core 0 : BW 2449.87 MB/s
sequential core 15 -> core 0 : BW 2485.48 MB/s
sequential core 16 -> core 0 : BW 2938.08 MB/s
sequential core 17 -> core 0 : BW 2082.12 MB/s
sequential core 18 -> core 0 : BW 2041.84 MB/s
sequential core 19 -> core 0 : BW 2060.47 MB/s
sequential core 20 -> core 0 : BW 2944.13 MB/s
sequential core 21 -> core 0 : BW 2111.06 MB/s
sequential core 22 -> core 0 : BW 2063.37 MB/s
sequential core 23 -> core 0 : BW 2082.75 MB/s
sequential core 24 -> core 0 : BW 2958.05 MB/s
sequential core 25 -> core 0 : BW 2091.85 MB/s
sequential core 26 -> core 0 : BW 2098.73 MB/s
sequential core 27 -> core 0 : BW 2083.7 MB/s
sequential core 28 -> core 0 : BW 2934.43 MB/s
sequential core 29 -> core 0 : BW 2048.68 MB/s
sequential core 30 -> core 0 : BW 2087.6 MB/s
sequential core 31 -> core 0 : BW 2014.68 MB/s

all-contention core 0 -> core 0 : BW 1081.85 MB/s
all-contention core 1 -> core 0 : BW 299.177 MB/s
all-contention core 2 -> core 0 : BW 298.853 MB/s
all-contention core 3 -> core 0 : BW 263.735 MB/s
all-contention core 4 -> core 0 : BW 1081.93 MB/s
all-contention core 5 -> core 0 : BW 299.177 MB/s
all-contention core 6 -> core 0 : BW 299.63 MB/s
all-contention core 7 -> core 0 : BW 263.795 MB/s
all-contention core 8 -> core 0 : BW 1081.98 MB/s
all-contention core 9 -> core 0 : BW 299.177 MB/s
all-contention core 10 -> core 0 : BW 300.149 MB/s
all-contention core 11 -> core 0 : BW 262.905 MB/s
all-contention core 12 -> core 0 : BW 1081.89 MB/s
all-contention core 13 -> core 0 : BW 299.173 MB/s
all-contention core 14 -> core 0 : BW 299.025 MB/s
all-contention core 15 -> core 0 : BW 263.865 MB/s
all-contention core 16 -> core 0 : BW 432.156 MB/s
all-contention core 17 -> core 0 : BW 233.12 MB/s
all-contention core 18 -> core 0 : BW 232.889 MB/s
all-contention core 19 -> core 0 : BW 202.48 MB/s
all-contention core 20 -> core 0 : BW 434.299 MB/s
all-contention core 21 -> core 0 : BW 233.274 MB/s
all-contention core 22 -> core 0 : BW 233.144 MB/s
all-contention core 23 -> core 0 : BW 202.505 MB/s
all-contention core 24 -> core 0 : BW 434.295 MB/s
all-contention core 25 -> core 0 : BW 233.274 MB/s
all-contention core 26 -> core 0 : BW 233.169 MB/s
all-contention core 27 -> core 0 : BW 202.49 MB/s
all-contention core 28 -> core 0 : BW 434.295 MB/s
all-contention core 29 -> core 0 : BW 233.309 MB/s
all-contention core 30 -> core 0 : BW 233.169 MB/s
all-contention core 31 -> core 0 : BW 202.526 MB/s

two-contention core 0 -> core 0 : BW 3306.11 MB/s
two-contention core 1 -> core 0 : BW 2199.7 MB/s

two-contention core 0 -> core 0 : BW 3286.21 MB/s
two-contention core 2 -> core 0 : BW 2220.73 MB/s

two-contention core 0 -> core 0 : BW 3302.24 MB/s
two-contention core 3 -> core 0 : BW 2182.81 MB/s

two-contention core 0 -> core 0 : BW 3605.88 MB/s
two-contention core 4 -> core 0 : BW 3605.88 MB/s

two-contention core 0 -> core 0 : BW 3297.08 MB/s
two-contention core 5 -> core 0 : BW 2217.82 MB/s

two-contention core 0 -> core 0 : BW 3312.69 MB/s
two-contention core 6 -> core 0 : BW 2227.04 MB/s

two-contention core 0 -> core 0 : BW 3287.93 MB/s
two-contention core 7 -> core 0 : BW 2209.48 MB/s

two-contention core 0 -> core 0 : BW 3660.05 MB/s
two-contention core 8 -> core 0 : BW 3660.05 MB/s

two-contention core 0 -> core 0 : BW 3339.63 MB/s
two-contention core 9 -> core 0 : BW 2223.84 MB/s

two-contention core 0 -> core 0 : BW 3303.77 MB/s
two-contention core 10 -> core 0 : BW 2197.99 MB/s

two-contention core 0 -> core 0 : BW 3323.19 MB/s
two-contention core 11 -> core 0 : BW 2196.08 MB/s

two-contention core 0 -> core 0 : BW 3582.23 MB/s
two-contention core 12 -> core 0 : BW 3582.22 MB/s

two-contention core 0 -> core 0 : BW 3324.9 MB/s
two-contention core 13 -> core 0 : BW 2250.74 MB/s

two-contention core 0 -> core 0 : BW 3305.66 MB/s
two-contention core 14 -> core 0 : BW 2209.5 MB/s

two-contention core 0 -> core 0 : BW 3303.52 MB/s
two-contention core 15 -> core 0 : BW 2182.43 MB/s

two-contention core 0 -> core 0 : BW 3352.74 MB/s
two-contention core 16 -> core 0 : BW 2607.73 MB/s

two-contention core 0 -> core 0 : BW 3092.65 MB/s
two-contention core 17 -> core 0 : BW 1911.98 MB/s

two-contention core 0 -> core 0 : BW 3025.91 MB/s
two-contention core 18 -> core 0 : BW 1918.06 MB/s

two-contention core 0 -> core 0 : BW 3257.56 MB/s
two-contention core 19 -> core 0 : BW 1885.03 MB/s

two-contention core 0 -> core 0 : BW 3339.64 MB/s
two-contention core 20 -> core 0 : BW 2603.06 MB/s

two-contention core 0 -> core 0 : BW 3119.29 MB/s
two-contention core 21 -> core 0 : BW 1918.6 MB/s

two-contention core 0 -> core 0 : BW 3054.14 MB/s
two-contention core 22 -> core 0 : BW 1910.61 MB/s

two-contention core 0 -> core 0 : BW 3214.44 MB/s
two-contention core 23 -> core 0 : BW 1881.69 MB/s

two-contention core 0 -> core 0 : BW 3332.3 MB/s
two-contention core 24 -> core 0 : BW 2611.8 MB/s

two-contention core 0 -> core 0 : BW 3111.94 MB/s
two-contention core 25 -> core 0 : BW 1922.11 MB/s

two-contention core 0 -> core 0 : BW 3049.02 MB/s
two-contention core 26 -> core 0 : BW 1912.85 MB/s

two-contention core 0 -> core 0 : BW 3251.88 MB/s
two-contention core 27 -> core 0 : BW 1881.82 MB/s

two-contention core 0 -> core 0 : BW 3345.6 MB/s
two-contention core 28 -> core 0 : BW 2598.82 MB/s

two-contention core 0 -> core 0 : BW 3109.04 MB/s
two-contention core 29 -> core 0 : BW 1923.81 MB/s

two-contention core 0 -> core 0 : BW 3062.94 MB/s
two-contention core 30 -> core 0 : BW 1921.3 MB/s

two-contention core 0 -> core 0 : BW 3220.8 MB/s
two-contention core 31 -> core 0 : BW 1901.76 MB/s

If someone has further improvements, I'd be happy to hear about them. For example, these are obviously not perfect bandwidth measurements in real-world units (likely off by a--hopefully constant--integer factor).

a few comments:

to see the NUMA structure of your system (on linux), you can get a graphical overview using the lstopo utility from the hwloc library. In particular, you'll see which core numbers are members of which NUMA node (processor socket)
char is probably not the ideal data type to measure the maximum RAM throughput. I suspect that using a 32 bit or 64 bit data type, you can get more data through with the same number of cpu cycles.
More generally, you should also check that your measurement is not limited by the CPU speed but by the RAM speed. The ramspeed utility for example unrolls the inner loop explicitly to some extent in the source code:
```
for(i = 0; i < blk/sizeof(UTL); i += 32) {
    b[i] = a[i];        b[i+1] = a[i+1];
    ...
    b[i+30] = a[i+30];  b[i+31] = a[i+31];
}
```
EDIT: on supported architectures ramsmp actually even uses 'hand written' assembly code for these loops
L1/L2/L3 Cache effects: It is instructive to measure the bandwidth in GByte/s as function of the block size. You should see that roughly four different speeds when increasing the block size corresponding to where you are reading the data from (caches or main memory). Your processor seems to have 8 MByte of Level3 (?) cache, so your 10 Million bytes might just mostly stay in the L3 cache (which is shared among all cores of one processor).
Memory channels: Your processor has 3 memory channels. If your memory banks are installed such you can exploit them all (see e.g. the motherboard's manual), you may want to run more than one thread at the same time. I saw effects that when reading with one thread only, the asymptotic bandwidth is close to the one of a single memory module (e.g. 12.8 GByte/s for DDR-1600), while when running multiple threads, the asymptotic bandwidth is close to the number of memory channels times the bandwidth of a single memory module.

You can also you use numactl to choose which node to run the process on and where to allocate memory from:

numactl --cpubind=0 --membind=1 <process>

I use this combined with LMbench to get memory latency numbers:

numactl --cpubind=0 --membind=0  ./lat_mem_rd -t 512
numactl --cpubind=0 --membind=1  ./lat_mem_rd -t 512

If anyone else wants to try this test, here is the modified, working program. I would love to see results from other hardware. This Works On My Machine with Linux 2.6.34-12-desktop, GCC 4.5.0, Boost 1.47.

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp

numatest.cpp

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>

void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}

void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    void* y = numa_alloc_local(N);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            *(((char*)y) + ((j * 1009) % N)) += 1;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;

    *x = y;
}

void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            *(((char*)x) + ((j * 1009) % N)) += 1;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}

int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();

    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);

    void* x;
    size_t N(10000000);
    size_t M(5);

    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();

    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }

    numa_free(x, N);

    return 0;
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow