The answer to your question can be found by looking at the source code in the file memset64.asm
in Agner Fog's asmlib.
His code has a version for AVX and SSE. From what I can tell the code does _mm256_store_ps (vmovaps)
for some size of the array less than MemsetCacheLimit
. For larger array sizes he does non-temporal stores with _mm256_stream_ps (vmovntps)
. There are several other factors which can affect the results. See the code. You could probably get the same performance for most cases with C/C++ using intrinsic functions.
Note that the both the built-in memset function in GCC as well as the version in glibc last I checked are not optimized (which is one reason memset is in the asmlib).