Is movndq works?

https://stackoverflow.com/questions/10872124

12-06-2021
|

문제

My task is to calculate RAM Read/Write speed. I using asm inserts to avoid compiler optimizations. To measure time I use TSC and CPU frequency. To move data I use asm instruction MOVNTDQ which doesn't use cache hierarchy.

Problem is in results. Data rate (by datasheet) is 800 Mbps, and I got by my test > 2000 Mbps write speed.

void memory_notCache_write_128(void* src, long blocks_amount) 
{
    _asm    
    {
        mov ecx, blocks_amount
        mov     esi, src
    a20:
        movntdq [esi], xmm0
        movntdq [esi + 16], xmm1
        movntdq [esi + 32], xmm2
        movntdq [esi + 48], xmm3
        movntdq [esi + 64], xmm4
        movntdq [esi + 80], xmm5
        movntdq [esi + 96], xmm6
        movntdq [esi + 112], xmm7
        add esi,  128
        loop    a20;
    }
}

int main()
{ 
    unsigned __int64 tick1, tick2;
    const long nBytes = 32*KByte;   

    char* source = (char*)_mm_malloc(nBytes*sizeof(char),16);

    tick1 =  getTicks();
    memory_notCache_write_128(source, current_times.t128);
    tick2 =  getTicks();

    double time = (double)(tick2-tick1)/(ProcSpeedCalc());
    cout << "Time WRITE_128[seconds]:" << time << endl;
    cout << (double) nBytes / time / MByte << endl;

    return 0;
}

Datasheet of RAM, that I used - http://www.alldatasheet.com/datasheet-pdf/pdf/308537/ELPIDA/EBE11UE6ACUA-8G-E.html

Source code (was written for Win patform): https://bitbucket.org/closed_eyes/ram_speed_for_win/downloads/memory_test.cpp

해결책

You shouldn't use non-temporal operations for this sort of code. The real way to build a memory performance tester is to use the access pattern to make sure that you never hit in the cache. Generally, this is done by looping over a very large chunk of memory that is bigger than the last level of cache in your system where your stride is the same as the cache line size. If you do this, you'll ensure that every access will be a cache miss in all levels. Don't forget though that when you read just one byte from memory, the processor will fetch a whole cache line, so if you do a 64-bit load, on a machine with a 64-byte cache line (very common), you should be counting 64-bytes as being read from memory.

다른 팁

Yury, your idea of using movntdq to measure 'physical' memory channel bandwidth is correct. I agree with Nathan Binkert on how to address 'system-wide' memory performance however I'd like to elaborate on your original questions on movntdq applicability in general and 800Mbps confusion.

Short version:

movntdq works fine and is ok to be used when you want to measure bandwidth of 'physical' memory channel.
800Mbps is a 'bit-lane' specification. Each (of possibly two) memory controller channel is 64-bit wide. Two memory controller channels will deliver near to 1600MBytes/s raw write performance, however this still does not match with your actual measurements so please take a look on details below.
really stop using rdtsc. Use only QueryPerformanceFrequency and QueryPerformanceCounter for yor profiling and increase test buffer size if you face problems with measurement precision.
please specify details of your hardware platform (cpu, number of sodimms etc.) make sure you don't have any memory overclocking in bios setup.

Longer version.

As in short version: movntdq is okay. You must align series of movntdq writes to be a multiple of cpu cache line (64 bytes) and you must align beginning of movntdq write to a 64-byte boundary. Using non-aligned access will result in invalidation of non-temporal hint of the instruction so both memory_notCache_write_32 and memory_notCache_write_16 functions are not right choice of using movndq instruction.
As in short version: 800Mbps is a single bit-lane speed. SODIMM data path is 64 bits same as cpu/northbridge memory channel. When talking about movntdq instructions there are most probably two memory channels but they will operate in a 'dual-channel' mode only if there are two matching sodimms installed in a right memory slots on a mainboard. Two channels should effectively give you 1600 Mbytes/s while single-channel will give you 800Mbytes/s performance. Your actual figures became not so much different from 1600Mbytes/s estimation however they are still far from a close match. This might be from both incorrect measurement method (see point 3) and/or overclocked memory (not likely, but just in case see point 4).
"QueryPerformanceFrequency and QueryPerformanceCounter should be enough for everyone" :) Seriously, just stop using rdtsc at all at this stage of your project. 3+ MHz timer precision (QueryPerformanceFrequency) will be okay when you measure memory write performance over 10Mbytes memory region. Consider theoretical memory bandwidth of 1600MBytes/s each tick of 3MHz timer will result in 533 bytes 'measurement error' which is nothing when writing 10Mbytes. rdtsc is very tricky stuff mainly because it's not stable over time on cpus with power management enabled (I am certain you are not on the 2nd/3rd generation of intel core cpus where rdtsc delivers stable counting). Please start with system-provided functions for timing measurements to have your measurements done right. It's also worth to check what value is given by QueryPerformanceFrequency on your platform.
Since you are measuring physical memory channel bandwidth it's worth to specify what hardware platform you are using for such measurements. Please make sure you don't have any manual memory timing settings in bios setup (having 533MHz memory bus will deliver 2+GBytes/s memory bandwidth given that memory controller is in dual-channel mode). Given that you are using an embedded system (SODIMM) there's a chance that memory controller settings are tweaked in bios. Just double-check there are no overclocking settings.

As a conclusion - don't use rdtsc, use only QueryPerformanceFrequency and QueryPerformanceCounter, keep using aligned versions memory writes with movntdq and check configuration of your embedded system. I would also strongly recommend complete avoidance of inline assembly usage and switch to using _mm_stream_si128 instead (http://msdn.microsoft.com/en-us/library/ba08y07y.aspx)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow