HLSL Pixel shader lighting performance (XNA)

Question 1

My guess is that changing the loop constraint to be a compile-time constant is allowing the HLSL compiler to unroll the loop. That is, instead of this:

for (int i = 0; i < 7; i++)
    doLoopyStuff();

It's getting turned into this:

doLoopyStuff();
doLoopyStuff();
doLoopyStuff();
doLoopyStuff();
doLoopyStuff();
doLoopyStuff();
doLoopyStuff();

Loops and conditional branches can be a significant performance hit inside of shader code, and should be avoided wherever possible.

EDIT

This is just off the top of my head, but maybe you could try something like this?

for (int i = 0; i < MAX_LIGHTS; i++)
{
    color += step(i, activeLights) * lightingFunction();
}

This way you calculate all possible lights, but always get a value of 0 for inactive lights. The benefit would depend on the complexity of the lighting function, of course; you would need to do more profiling.

Question 2

Try using PIX to profile it. http://wtomandev.blogspot.com/2010/05/debugging-hlsl-shaders.html

Alternatively, read this rambling speculation:

Maybe because with a constant, the compiler can unravel and collapse your loop's instructions. When you replace it with a variable, the compiler becomes unable to make the same assumptions.

Though, somewhat unrelated to your actual question, I would push a lot of those conditions /calculations to the software level.

if(IsDiffuseLightingEnabled || IsSpecularLightingEnabled)

^- Like that.

Also, I think you could precompute a few thing before you call the shader program as well. Like l = (lights[i].Position - IN.WorldPosition) / lights[i].Radius; Pass a precomputed array of those rather than calculating each time over every pixel.

I might be misinformed of the optimizations that the HLSL compiler does, but I think each calculation you do like that on the pixel shader gets executed screen w*h times (though this is done insanely parallel), and I vaguely remember there being some limits to the number of instructions you could have in a shader (like 72?). (though I think that restriction was liberalized a lot in higher versions of HLSL). Maybe the fact that your shader generates so many instructions -- maybe it breaks your program up and turns it into a multi-pass pixel shader on compilation. If that's the case, that probably adds significant overhead.

Actually, here's another idea that might be stupid: Passing a variable to a shader has it transmit the data to the GPU. That transmission happens with limited bandwidth. Perhaps the compiler is smart enough such that when you're only staticly indexing the first 7 elements in an array, only transfer 7 elements. When the compiler doesn't make that optimization (because you aren't iterating with constants), it pushes the WHOLE array every frame, and you're flooding the bus. If that's the case, then my earlier suggestion of pushing calculations out, and passing more results in, would only make the problem worse, heh.