Looking at stores, the number you are getting is pretty close to N**3 / 4
. We'd expect it to be O(N**3), obviously.
That suggests that 4 float writes are coalesced into one of whatever PAPI_SR_INS is measuring. Looking at it alternatively you're counting the number of 16 byte writes.
Similarly the number of loads is roughly 3/4 N**3
. The dominant term should be the load from b and c inside the innermost loop, which would be 2 reads per iteration. To be honest I can't make much sense of that.
If you don't know exactly what you're measuring, and you don't correlate it with the generated code, it's pretty hard to predict the measurement.
EDIT: the numbers appear to correlate to the load and store instructions executed, but not to the number of L1, L2, etc transactions or misses - so unlikely to correlate to actual performance. Isn't the time taken a better number to worry about? Given the complexity of modern CPU architecture I'd trust measurement over prediction any day.