Branchless generation of IDCT matrix?

https://stackoverflow.com/questions/12157384

28-06-2021
|

Question

I have some code that performs the IDCT on the GPU. I've noticed that it seems to be faster to generate the IDCT matrix on the gpu, rather than pre-computing the transformation matrix and putting it into constant memory.

The problem is that the code generating the IDCT matrix has a branch which does not fit well with the GPU.

I'm wondering if there are any alternative ways to generate the IDCT matrix that is faster on the GPU?

// Old way
// local_idct[x][y] = idct[x][y]; // read from precalculated matrix in constant memory
// New way
local_idct[x][y] = cos((x+x+1)*y * (PI/16.0f)) * 0.5f * (y == 0 ? rsqrt(2.0f) : 1);

Solution

Assuming your transform size is small and fixed you could just use a lookup table for this term, e.g.

const float y_term[8] = { 1.0f/sqrtf(2.0f), 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f };

local_idct[x][y] = cos((x+x+1)*y * (PI/16.0f)) * 0.5f * y_term[y];

You could also fold in the 0.5 term:

const float y_term[8] = { 0.5f/sqrtf(2.0f), 0.5f, 0.5f, 0.5f, 0.5f, 0.5f, 0.5f, 0.5f };

local_idct[x][y] = cos((x+x+1)*y * (PI/16.0f)) * y_term[y];

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow