How to best display 30+ bpp graphics on a 24 bpp display?

Question

The best solution I can think of is based on a random dithering that changes every frame. This combines the advantage of dithering with not having a fixed dithering pattern, and since a given pixel changes values many times a second what you perceive is closer to the average of those various values, which is closer to the original "deep color" value than any given 24 bpp value.

How it looks

A gradient of green, undithered, dithered (10 frames are shown), then both enhanced in the same way for visibility:

Banded gradient

Dithered gradient

Enhanced banded gradient

Enhanced dithered gradient

The dithering

The dithering is achieved by adding the gamma-compressed deep color value for each channel with a random value, then rounding to the nearest 8-bit value. It would seem natural to use random numbers with a uniform distribution between -0.5 and 0.5 (I'm talking in units that are equivalent to 1 in 8-bit gamma-compressed values, like the difference between 0 and 1 or 254 and 255), however this would result in a sort of banding artifact where the values of a gradient close to an 8-bit value would have little noise whereas values the furthest from any 8-bit value would show a lot more noise. A Gaussian noise is much more suitable as it gives a much smoother noise level. I chose a sigma of 1.0, but for less noise a sigma of 0.8 might do.

You can create a Gaussian PRNG by taking two random numbers, n1 and n2, fitting them each in the [-1 , 1] range, and if they represent a point within the unit circle (if the sum sum of their squares is inferior or equal to 1, otherwise start again) return sqrt(-2. * log(sum) / sum) * n1.

Practical implementation

I chose to implement this by converting a 15 bit per channel linear RGB framebuffer into an 8 bit per channel sRGB framebuffer. The linear to sRGB part is just a detail, I use a lookup table to transform the linear values into gamma-compressed values (I chose to make those intermediate values use 13 bits, you can see it as an 8.5 fixed point notation for sRGB values).

It should go without saying that you're not going to generate a new random Gaussian number for each pixel, you'll want to precalculate a bunch of them and put them in a circular buffer. I chose to make 16384 of them, yes, only 16384, I avoid any repeating patterns by choosing a random entry point in this buffer, a random length to go through (between 100 and 1123, this is pretty arbitrary), and when I reach the end of the length I chose a new random starting point and a new random length. This way I get pretty random non-repeating patterns out of a relatively small buffer of numbers. The numbers in the buffer are stored in 2.5 fixed point format, this way they are all between -4.0 and 4.0 which covers for the range of Gaussian random numbers I want to have. Just make sure to add 0.5 to your random numbers as this will take care of the rounding to the nearest integer later.

Here's basically how it works for each pixel and each channel:

15-bit linear value --via LUT--> 13-bit (8.5 fixed point) gamma-compressed value then ADD 2.5 fixed point random number then SHIFT 5 bits to the right.

Now you get an integer value between -4 and 260, you can use if()s to limit those, but it's much faster to use a 264 element LUT that returns 0 for negative numbers (you can use negative numbers as the index by allocating your buffer then doing buffer = &buffer[4], saves you an addition I guess) and that returns 255 for numbers above 255. Also I use the same random number for each of the three color channels, this avoids chromatic noise, though arguably the result might look somewhat less noisy if those three use independent numbers.

For a single pixel's red channel my code looks like this: sfb[i].r = bytecheck_l.lutb[lsrgb_l.lutint[fb[i].r] + dither_l.lutint[id] >> 5]; sfb being the sRGB 24 bpp buffer, fb being the 45 bpp linear RGB buffer, lsrgb_l.lutint[] being the linear to gamma-compressed LUT, dither_l.lutint[] the LUT containing the random Gaussian numbers in 2.5 fixed point format and bytecheck_l.lutb[] returning values clipped to [0 , 255].

Performance

I get over 50 FPS in a 1400x820 SDL window with my test gradient using just one core of a 2.4 GHz Core 2 Quad Q6600 and dual channel 800 MHz DDR2 memory, a somewhat mediocre machine by current standards, so this solution seems definitely suitable for modern computers.

Please let me know if any of my explanations require clarifications.