Suggestions on optimizing a Z-buffer implementation?

https://stackoverflow.com/questions/8261313

08-03-2021
|

Question

I'm writing a 3D graphics library as part of a project of mine, and I'm at the point where everything works, but not well enough.

In particular, my main headache is that my pixel fill-rate is horribly slow -- I can't even manage 30 FPS when drawing a triangle that spans half of an 800x600 window on my target machine (which is admittedly an older computer, but it should be able to manage this . . .)

I ran gprof on my executable, and I end up with the following interesting lines:

  %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
43.51      9.50     9.50                             vSwap
34.86     17.11     7.61   179944     0.04     0.04  grInterpolateHLine
13.99     20.17     3.06                             grClearDepthBuffer
<snip>
0.76      21.78     0.17      624     0.27    12.46  grScanlineFill

The function vSwap is my double-buffer swapping function, and it also performs vsyching, so it makes sense to me that the test program will spend much of its time waiting in there. grScanlineFill is my triangle-drawing function, which creates an edge list and then calls grInterpolateHLine to actually fill in the triangle.

My engine is currently using a Z-buffer to perform hidden surface removal. If we discount the (presumed) vsynch overhead, then it turns out that the test program is spending something like 85% of its execution time either clearing the depth buffer, or writing pixels according to the values in the depth buffer. My depth buffer clearing function is simplicity itself: copy the maximum value of a float into each element. The function grInterpolateHLine is:

void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
    for(; x1 <= x2; x1 ++, z += zstep) {
        if(z < grDepthBuffer[x1 + y*VIDEO_WIDTH]) {
            vSetPixel(x1, y, colour);
            grDepthBuffer[x1 + y*VIDEO_WIDTH] = z;
        }
    }
}

I really don't see how I can improve that, especially considering that vSetPixel is a macro.

My entire stock of ideas for optimization has been whittled down to precisely one:

Use an integer/fixed-point depth buffer.

The problem that I have with integer/fixed-point depth buffers is that interpolation can be very annoying, and I don't actually have a fixed-point number library yet. Any further thoughts out there? Any advice would be most appreciated.

Solution

You should have a look at the source code to something like Quake - considering what it could achieve on a Pentium, 15 years ago. Its z-buffer implementation used spans rather than per-pixel (or fragment) depth. Otherwise, you could look at the rasterization code in Mesa.

OTHER TIPS

Hard to really tell what higher order optimizations can be done without seeing the rest of the code. I have a couple of minor observation, though.

There's no need to calculate x1 + y * VIDEO_WIDTH more than once in grInterpolateHLine. i.e.:

void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
    int offset = x1 + (y * VIDEO_WIDTH);
    for(; x1 <= x2; x1 ++, z += zstep, offset++) {
        if(z < grDepthBuffer[offset]) {
            vSetPixel(x1, y, colour);
            grDepthBuffer[offset] = z;
        }
    }
}

Likewise, I'm guessing that your vSetPixel does a similar calculation, so you should be able to use the same offset there as well, and then you only need to increment offset and not x1 in each loop iteration. Chances are this can be extended back to the function that calls grInterpolateHLine, and you would then only need to do the multiplication once per triangle.

There are some other things you could do with the depth buffer. Most of the time if the first pixel of the line either fails or passes the depth test, then the rest of the line will have the same result. So after the first test you can write a more efficient assembly block to test the entire line in one shot, then if it passes you can use a more efficient block memory setter to block-set the pixel and depth values instead of doing them one at a time. You would only need to test/set per pixel if the line is only partially occluded.

Also, not sure what you mean by older computer, but if your target computer is multi-core then you can break it up among multiple cores. You can do this for the buffer clearing function as well. It can help quite a bit.

I ended up solving this by replacing the Z-buffer with the Painter's Algorithm. I used SSE to write a Z-buffer implementation that created a bitmask w/the pixels to paint (plus the range optimization suggested by Gerald), and it still ran far too slowly.

Thank you, everyone, for your input.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow