I noticed only one significant difference between assembly codes generated with or without PGO. Without PGO sum
variable is spilled from register to memory, once per inner loop iteration. This writing variable to memory and loading it back might in theory slow down things very significantly. Fortunately modern processors optimize it with store-to-load forwarding, so that slowdown is not so big. Still Intel's optimization manual does not recommend to spill floating point variables to memory, especially when they are computed by long-latency operations, like floating point multiplication.
What is really puzzling here is why GCC needs PGO to avoid spilling register to memory. It is enough unused floating point registers, and even without PGO compiler could get all information necessary for proper optimization from single source file...
These unnecessary load/store operations explain not only why PGO code is faster, but also why it increases percentage of cache misses. Without PGO register is always spilled to the same location in memory, so this additional memory access increases both number of memory accesses and number of cache hits, while it does not change number of cache misses. With PGO we have less memory accesses but same amount of cache misses, so their percentage increases.