Quoting the authoritative answer provided by Vladimir Kozlov at hotspot-compiler-dev
mailing list:
Hi Marko,
For primitive arrays we use handwritten assembler code which use XMM registers as vectors for initialization. For object arrays we did not optimize it because it is not common case. We can improve it similar to what we did for arracopy but we decided leave it for now.
Regards,
Vladimir
I have also wondered why the optimized code is not inlined, and got that answer as well:
The code is not small, so we decided to not inline it. Look on MacroAssembler::generate_fill() in macroAssembler_x86.cpp:
My original answer:
I missed an important bit in the machine code, apparently because I was looking at the On-Stack Replacement version of the compiled method instead of the one used for subsequent calls. It turns out that HotSpot was able to prove that my loop amounts to what a call to Arrays.fill
would have done and replaced the entire loop with a call
instruction to such code. I can't see that function's code, but it probably uses every possible trick, such as MMX instructions, to fill a block of memory with the same 32-bit value.
This gave me the idea to measure the actual Arrays.fill
calls. I got more surprise:
Benchmark Mode Thr Cnt Sec Mean Mean error Units
fillPrimitiveArray avgt 1 5 2 155.343 1.318 nsec/op
fillReferenceArray avgt 1 5 2 682.975 17.990 nsec/op
loopFillPrimitiveArray avgt 1 5 2 156.114 0.523 nsec/op
loopFillReferenceArray avgt 1 5 2 682.209 7.047 nsec/op
The results with a loop and with a call to fill
are identical. If anything, this is even more confusing than the results which motivated the question. I would have at least expected fill
to benefit from the same optimization ideas regardless of value type.