Fastest way to memset on modern amd64 CPUs

https://stackoverflow.com/questions/22363305

13-06-2023
|

Pergunta

I'd like to fill an array of 4096 bytes (aligned to the 4096-byte boundary) with zeros in amd64 assembly. I'm looking for both portable and single-CPU-type-only solutions.

I know that rep stosq would do the trick, but is there anything faster? MMX? SSE? How much faster is it? How many bytes can be written to memory in a single instruction (without rep)? We can assume that the memory cache is empty. I don't need a fully working function implementation, I just need the basic idea with its crucial assembly instruction.

I've just seen the movdqa instruction which can write 16 bytes at a time. Is it twice as fast as 2 mov instructions of 8 bytes each?

Solução

The answer to your question can be found by looking at the source code in the file memset64.asm in Agner Fog's asmlib.

His code has a version for AVX and SSE. From what I can tell the code does _mm256_store_ps (vmovaps) for some size of the array less than MemsetCacheLimit. For larger array sizes he does non-temporal stores with _mm256_stream_ps (vmovntps). There are several other factors which can affect the results. See the code. You could probably get the same performance for most cases with C/C++ using intrinsic functions.

Note that the both the built-in memset function in GCC as well as the version in glibc last I checked are not optimized (which is one reason memset is in the asmlib).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow