سؤال

I'd like to fill an array of 4096 bytes (aligned to the 4096-byte boundary) with zeros in amd64 assembly. I'm looking for both portable and single-CPU-type-only solutions.

I know that rep stosq would do the trick, but is there anything faster? MMX? SSE? How much faster is it? How many bytes can be written to memory in a single instruction (without rep)? We can assume that the memory cache is empty. I don't need a fully working function implementation, I just need the basic idea with its crucial assembly instruction.

I've just seen the movdqa instruction which can write 16 bytes at a time. Is it twice as fast as 2 mov instructions of 8 bytes each?

هل كانت مفيدة؟

المحلول

The answer to your question can be found by looking at the source code in the file memset64.asm in Agner Fog's asmlib.

His code has a version for AVX and SSE. From what I can tell the code does _mm256_store_ps (vmovaps) for some size of the array less than MemsetCacheLimit. For larger array sizes he does non-temporal stores with _mm256_stream_ps (vmovntps). There are several other factors which can affect the results. See the code. You could probably get the same performance for most cases with C/C++ using intrinsic functions.

Note that the both the built-in memset function in GCC as well as the version in glibc last I checked are not optimized (which is one reason memset is in the asmlib).

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top