Flushing the L2 cache on ARM (Tegra 3) unexpectedly increases performance

https://stackoverflow.com/questions/14822275

09-03-2022
|

質問

In this scenario:

I receive of about 1MB over the network into a buffer over TCP in a separate thread.
Decompression on this buffer results in about 2MB of data.
Two calls to the same dynamically linked library function of which I own the source. Basically a bunch of FFTs using FFTW 3.3.3 with NEON support. The first set of FFTs I consider cold, the second hot.

The cold run is about 200 ms slower than the hot run:

Cold: 570 ms
Hot: 260 ms

If 1.) and 2.) are replaced by a read from a file of the exact same data, the hot and cold runs are equally fast.

If I reduce the network data to about 200K, and thus the decompressed data to 400K, performance is the same between cold and hot.

If I perform an L2 flush* immediately after 2.) the cold performance increases to be the same as the hot performance. I don't understand this. I've tried changing many compiler options, and as long as the optimizer is used I see this behavior.

If I flush less of the cache, then the performance of the cold run worsens proportionally.

*Here is the code that I'm using to attempt to flush the 1MB L2 cache:

const size_t cache_size = 1024 * 1024;
char *cache = new char[cache_size];
srand(1);
for (size_t dc = 1; dc < cache_size; ++dc)
{
    cache[dc] = cache[dc - 1] + rand() * 255;
}

Using a giant switch statement to flush the instruction cache does not have much of an effect. I used 32,000 cases as shown here: How can I cause an instruction cache miss?

If I copy the data that the FFTs operate on to a duplicate structure in a different part of memory, and then operate on copy, it does not have the same effect as flushing L2.

I'd like to understand what's going on. I've forced process and thread affinity to a single core on the Tegra 3. What other easily accessible measurements can I make or view?

解決

If the cache is configured for write-back (which is typically higher performance) then your flush is causing all the output of the first FFTW to get written to memory rather than waiting until the cache line is needed by the second FFTW. So the cache is empty rather than polluted with dirty lines.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow