There could be several things happening here. To be sure you are going to have to read the actual assembly code and figure out what it is doing. The compiler is VERY clever when you have it set to a high optimization level. For example in your first code segment it is very possible for the compiler to have assembly statements out side of your
// instruction counting starts here
// instruction counting stops here
comments that perform work in between the comments. In your second example that optimization is not possible and all work needs to be done in the function. Also do not discount the amount of space the prolog and epilog of functions take. Depending on the instruction set of your processor and its stack and register usage it can be quite large. For example on Power PC there is no push many registers instruction and you have to push each individual register and pop each individual register off of the stack frame when enter and leaving a function. When you're dealing with 32 registers that can be quite a bit of code.
You could try a trick when you have high optimization levels set for you compiler. The compiler cannot optimize across "asm" statements as it does not know what happens in them. What you could do is put some dummy code in the "asm" statements. I personally like creating global symbols that are in the object file. That way I can get the address of the starting symbol and ending symbol and calculate the size of code in between. It looks something like this...
asm(" .globl sizeCalc_start");
asm(" sizeCalc_start: ");
// some code
asm(" .globl sizeCalc_end");
asm(" sizeCalc_end:");
Then you can do something in a function like
extern int sizeCalc_start;
extern int sizeCalc_end;
printf("Code Segment Size %d\r\n", &sizeCalc_end - &sizeCalc_start);
I've done this in the past and it worked. Have not tried to compile this so dunno you may need to mess around with it a bit to get what you want.