OpenGL Uniform Buffer std140 layout

https://stackoverflow.com/questions/10781434

11-06-2021
|

Question

I’m trying to pass an array of ints to the fragment shader via uniform block (everything is according to GLSL “#version 330”) on a GeForce 8600 GT.

On the side of the app I have:

int MyArray[7102];
…
//filling, binding, etc
…
glBufferData(GL_UNIFORM_BUFFER, sizeof(MyArray), MyArray, GL_DYNAMIC_DRAW);

In my fragment shader I declare according block as follows:

layout (std140) uniform myblock
{
int myarray[7102];
};

The problem is that after successful glCompileShader the glLinkProgram returns an error saying that it can’t bind appropriate storage resource.

Few additional facts:

1) GL_MAX_UNIFORM_BLOCK_SIZE returned value 65536

2) If I lower the number of elements to 4096 it works fine and makes no difference whether I use “int” or “ivec4” as the array type. Anything above 4096 gives me the same “storage error”

3) If I use “shared” or “packed” everything works as suspected

After consulting with GLSL 3.3 specification for std140, I’m assuming that there is a problem with aligning/padding according to:

“1) If the member is a scalar consuming N basic machine units, the base alignment is N.

...

4) If the member is an array of scalars or vectors, the base alignment and array stride are set to match the base alignment of a single array element, according to rules (1), (2), and (3), and rounded up to the base alignment of a vec4. The array may have padding at the end; the base offset of the member following the array is rounded up to the next multiple of the base alignment.”

My questions:

1) Is it true that “myblock” occupies 4 times bigger than just 7102*4=28408 bytes? I.e. std140 expands each member of myarray to vec4 and the real memory usage is 7102*4*4=113632 bytes which is the cause of the problem?

2) The reason it works with “shared” or “packed” is due to the elimination of these gaps because of optimization?

3) Maybe it’s a driver bug? All facts point to the “…and rounded up to the base alignment of a vec4” being the reason, but it’s quite hard to accept that something as simple as array of ints ends up being 4 times less effective in terms of memory constraints.

4) If it’s not a bug, then how should I organize and access an array in case of std140? I can use “ivec4” for optimal data distribution but then instead of simple x=myarray[i] I have to sacrifice performance doing something like x=myarray[i/4][i%4] to refer to individual elements of each ivec4? Or am I missing something and there is obvious solution?

Solution

1) (…) rounded up to the base alignment of a vec4? (…)

Yes.

2) The reason it works with “shared” or “packed” is due to the elimination of these gaps because of optimization?

Yes; only that this is not optimization performance wise.

3) Maybe it’s a driver bug?

EDIT ~~No. GPUs natually work with vectorized typed. Packing the types require to add further instructions to de-/multiplex the vectors.~~ In the time being since writing this answer significant changes to GPU architectures happend. GPUs made these days are all single scalar architectures with the design emphased on strong superscalar vectorization.

4) If it’s not a bug, then how should I organize and access an array in case of std140?

Don't use uniform buffer objects for such large data. Put the data into a 1D texture and use texelFetch to index into it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow