Question

I'm working on an OpenGL implementation of the oculus Rift distortion shader. The shader works by taking the input texture coordinate (of a texture containing a previously rendered scene) and transforming it using distortion coefficients, and then using the transformed texture to determine the fragment color.

I'd hoped to improve performance by pre-computing the distortion and storing it in a second texture, but the result is actually slower than the direct computation.

The direct calculation version looks basically like this:

float distortionFactor(vec2 point) {
    float rSq = lengthSquared(point);
    float factor =  (K[0] + K[1] * rSq + K[2] * rSq * rSq + K[3] * rSq * rSq * rSq);
    return factor;
}

void main()
{
    vec2 distorted = vRiftTexCoord * distortionFactor(vRiftTexCoord);
    vec2 screenCentered = lensToScreen(distorted);
    vec2 texCoord = screenToTexture(screenCentered);
    vec2 clamped = clamp(texCoord, ZERO, ONE);
    if (!all(equal(texCoord, clamped))) {
        vFragColor = vec4(0.5, 0.0, 0.0, 1.0);
        return;
    }
    vFragColor = texture(Scene, texCoord);
}

where K is a vec4 that's passed in as a uniform.

On the other hand, the displacement map lookup looks like this:

void main() {
    vec2 texCoord = vTexCoord;
    if (Mirror) {
        texCoord.x = 1.0 - texCoord.x;
    }
    texCoord = texture(OffsetMap, texCoord).rg;
    vec2 clamped = clamp(texCoord, ZERO, ONE);
    if (!all(equal(texCoord, clamped))) {
        discard;
    }
    if (Mirror) {
        texCoord.x = 1.0 - texCoord.x;
    }
    FragColor =  texture(Scene, texCoord);
}

There's a couple of other operations for correcting the aspect ratio and accounting for the lens offset, but they're pretty simple. Is it really reasonable to expect this to outperform a simple texture lookup?

Was it helpful?

Solution

GDDR memory is pretty high latency and modern GPU architectures have plenty of number crunching capabilities. It used to be the other way around, GPUs were so ill-equipped to do calculations that normalization was cheaper to do by fetching from a cube map.

Throw in the fact that you are not doing a regular texture lookup here, but rather a dependent lookup and it comes as no surprise. Since the location you are fetching from depends on the result of another fetch, it is impossible to pre-fetch / efficiently cache (an effective latency hiding strategy) the memory needed by your shader. That is no "simple texture lookup."

What is more, in addition to doing a dependent texture lookup your second shader also includes the discard keyword. This will effectively eliminate the possibility of early depth testing on a lot of hardware.

Honestly, I do not see why you want to "optimize" the distortionFactor (...) function into a lookup. It uses squared length, so you are not even dealing with a sqrt, just a bunch of multiplication and addition.

OTHER TIPS

Andon M. Coleman already explained what's going in. Essentially memory bandwith and more importantly memory latency are the main bottlenecks of modern GPUs, hence everthing built between about 2007 to today simple calculations are often way faster than a texture lookup.

In fact memory access patterns have such a large impact on efficiency that slightly rearranging the access pattern and assuring proper alignment can easily give performance boosts of a factor of 1000 (BT;DT however that was CUDA programming). Dependent lookup is not necessarily a performance killer, though: If the dependent texture coordinate lookup is monotonic with the controller texture it's usually not so bad.


That being said, did you never hear about Horner's Method? You can rewrite

float factor =  (K[0] + K[1] * rSq + K[2] * rSq * rSq + K[3] * rSq * rSq * rSq);

trivially to

float factor =  K[0]  + rSq * (K[1] + rSq * (K[2] + rSq * K[3]) );

Saving you a couple of operations.

The GPU is massively parallel, and can compute up to 1000's of results at a single clock-cycle. Memory read is always sequential. If it takes f.e. 5 clocks for the multiplications to compute, one can calculate 1000 results in 5 clock-cycles. If the data has to be read sequentially with f.e. 10 datasets per clock-cycle it would take 100 clock cycles instead of 5 to acquire the data. Number just by random to make you understand :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top