I can think of two possibilities, both relating to cache.
One is that the counter increment "warms" some important cache lines. Second is that adding the structures required by instrumentation causes some heavily used arrays or variables to fall into different cache lines.
Another issue is that profiling / increasing a counter doesn't have to happen every time in a for loop -- if there's no 'break' or 'return' in a loop, a compiler is allowed to optimize the increment out of the loop.