Ok, as usual, "the owls are not what they seem". Reasoning about code performance by inspecting the Java code quickly gets weird. Reasoning by looking into the bytecode feels the same. Generated code disassembly should shed more light on this, even though there are minor cases where the assembly is too high-level to explain the phenomenon.
That is because platforms heavily optimize the code, at every level. Here is the hint where you should look. Running you benchmark at i5 2.0 GHz, Linux x86_64, JDK 7u40.
Baseline:
Benchmark Mode Thr Count Sec Mean Mean error Units
j.b.s.StringBandBenchmark.stringBand2 thrpt 2 20 1 25800.465 297.737 ops/ms
j.b.s.StringBandBenchmark.stringBuilder2 thrpt 2 20 1 55552.936 876.021 ops/ms
Yeah, surprising. Now, watch this. Nothing in my sleeves, except for...
-XX:-OptimizeStringConcat:
Benchmark Mode Thr Count Sec Mean Mean error Units
j.b.s.StringBandBenchmark.stringBand2 thrpt 2 20 1 25727.363 207.979 ops/ms
j.b.s.StringBandBenchmark.stringBuilder2 thrpt 2 20 1 17233.953 219.510 ops/ms
Forbidding VM from string optimizations yield the "expected" result, as laid out in the original analysis. HotSpot is known to have the optimizations around StringBuilders, effectively recognizing the usual idioms like new StringBuilder().append(...).append(...).toString()
and producing more effective code for the statement.
Disassembling and figuring out what exactly happened with the string optimization applied is left as exercise for the interested readers :)