Question

I have a simple enough shader that supports multiple point lights.
Lights are stored as an array of Light structs (up to a max size) and I pass in the number of active lights when it changes.
The problem is in the PixelShader function:
It's basic stuff, get the base color from the texture, loop through the lights array for 0 to numActiveLights and add the effect, and it works fine, but performance is terrible!
BUT if I replace the reference to the global var numActiveLights with a constant of the same value performance is fine.
I just can't fathom why referencing the variable makes a 30+ fps difference.

Can anyone please explain?

Full Shader code:

#define MAX_POINT_LIGHTS 16

struct PointLight
{
    float3      Position;
    float4      Color;
    float       Radius;
};

float4x4    World;
float4x4    View;
float4x4    Projection;
float3  CameraPosition;

float4  SpecularColor;
float   SpecularPower;
float   SpecularIntensity;
float4      AmbientColor;
float   AmbientIntensity;
float   DiffuseIntensity;   

int     activeLights;
PointLight  lights[MAX_POINT_LIGHTS];

bool    IsLightingEnabled;
bool    IsAmbientLightingEnabled;
bool    IsDiffuseLightingEnabled;
bool    IsSpecularLightingEnabled;


Texture Texture;
sampler TextureSampler = sampler_state
{
    Texture = <Texture>;

    Magfilter = POINT;
    Minfilter = POINT;
    Mipfilter = POINT;

    AddressU = WRAP;
    AddressV = WRAP;
};

struct VS_INPUT
{
    float4 Position : POSITION0;
    float2 TexCoord : TEXCOORD0;
    float3 Normal : NORMAL0;
};

struct VS_OUTPUT
{
    float3 WorldPosition : TEXCOORD0;
    float4 Position : POSITION0;
    float3 Normal : TEXCOORD1;
    float2 TexCoord : TEXCOORD2;
    float3 ViewDir : TEXCOORD3;

};

VS_OUTPUT VS_PointLighting(VS_INPUT input)
{
    VS_OUTPUT output;

    float4 worldPosition = mul(input.Position, World);
    output.WorldPosition = worldPosition;

    float4 viewPosition = mul(worldPosition, View);
    output.Position = mul(viewPosition, Projection);

    output.Normal = normalize(mul(input.Normal, World));
    output.TexCoord = input.TexCoord;
    output.ViewDir = normalize(CameraPosition -  worldPosition);

    return output;
}

float4 PS_PointLighting(VS_OUTPUT IN) : COLOR
{
    if(!IsLightingEnabled) return tex2D(TextureSampler,IN.TexCoord);

    float4 color = float4(0.0f, 0.0f, 0.0f, 0.0f);

    float3 n = normalize(IN.Normal);
    float3 v = normalize(IN.ViewDir);
    float3 l = float3(0.0f, 0.0f, 0.0f);
    float3 h = float3(0.0f, 0.0f, 0.0f);

    float atten = 0.0f;
    float nDotL = 0.0f;
    float power = 0.0f;

    if(IsAmbientLightingEnabled) color += (AmbientColor*AmbientIntensity);

    if(IsDiffuseLightingEnabled || IsSpecularLightingEnabled)
    {
        //for (int i = 0; i < activeLights; ++i)//works but perfoemnce is terrible
        for (int i = 0; i < 7; ++i)//performance is fine but obviously isn't dynamic
        {
            l = (lights[i].Position - IN.WorldPosition) / lights[i].Radius;
            atten = saturate(1.0f - dot(l, l));

            l = normalize(l);

            nDotL = saturate(dot(n, l));

            if(IsDiffuseLightingEnabled) color += (lights[i].Color * nDotL * atten);
            if(IsSpecularLightingEnabled) color += (SpecularColor * SpecularPower * atten);
        }
    }

    return color * tex2D(TextureSampler, IN.TexCoord);
}

technique PerPixelPointLighting
{
    pass
    {
        VertexShader = compile vs_3_0 VS_PointLighting();
        PixelShader = compile ps_3_0 PS_PointLighting();
    }
}
Was it helpful?

Solution

My guess is that changing the loop constraint to be a compile-time constant is allowing the HLSL compiler to unroll the loop. That is, instead of this:

for (int i = 0; i < 7; i++)
    doLoopyStuff();

It's getting turned into this:

doLoopyStuff();
doLoopyStuff();
doLoopyStuff();
doLoopyStuff();
doLoopyStuff();
doLoopyStuff();
doLoopyStuff();

Loops and conditional branches can be a significant performance hit inside of shader code, and should be avoided wherever possible.

EDIT

This is just off the top of my head, but maybe you could try something like this?

for (int i = 0; i < MAX_LIGHTS; i++)
{
    color += step(i, activeLights) * lightingFunction();
}

This way you calculate all possible lights, but always get a value of 0 for inactive lights. The benefit would depend on the complexity of the lighting function, of course; you would need to do more profiling.

OTHER TIPS

Try using PIX to profile it. http://wtomandev.blogspot.com/2010/05/debugging-hlsl-shaders.html

Alternatively, read this rambling speculation:

Maybe because with a constant, the compiler can unravel and collapse your loop's instructions. When you replace it with a variable, the compiler becomes unable to make the same assumptions.

Though, somewhat unrelated to your actual question, I would push a lot of those conditions /calculations to the software level.

if(IsDiffuseLightingEnabled || IsSpecularLightingEnabled)

^- Like that.

Also, I think you could precompute a few thing before you call the shader program as well. Like l = (lights[i].Position - IN.WorldPosition) / lights[i].Radius; Pass a precomputed array of those rather than calculating each time over every pixel.

I might be misinformed of the optimizations that the HLSL compiler does, but I think each calculation you do like that on the pixel shader gets executed screen w*h times (though this is done insanely parallel), and I vaguely remember there being some limits to the number of instructions you could have in a shader (like 72?). (though I think that restriction was liberalized a lot in higher versions of HLSL). Maybe the fact that your shader generates so many instructions -- maybe it breaks your program up and turns it into a multi-pass pixel shader on compilation. If that's the case, that probably adds significant overhead.

Actually, here's another idea that might be stupid: Passing a variable to a shader has it transmit the data to the GPU. That transmission happens with limited bandwidth. Perhaps the compiler is smart enough such that when you're only staticly indexing the first 7 elements in an array, only transfer 7 elements. When the compiler doesn't make that optimization (because you aren't iterating with constants), it pushes the WHOLE array every frame, and you're flooding the bus. If that's the case, then my earlier suggestion of pushing calculations out, and passing more results in, would only make the problem worse, heh.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top