OpenGL - One Step Further On The Way To Faster QUADS Rendering

https://stackoverflow.com/questions/7825710

27-10-2019
|

Question

I have been experimenting a bit and can render around 3 millions of GL_QUADS onto screen now using

glDrawArrays(GL_QUADS, 0, nVertexCount);

I also use multiple buffering, cycling through 18 vertex buffer objects of 1 million vertices each. Every vertex position is computed using compressed data stored on the heap and a simple computation. I use

ptr = (float*)glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);

and

glUnmapBuffer(GL_ARRAY_BUFFER);

to write every single vertex to the buffer objects each frame. When a buffer object is full, I unmap it, call glDrawArrays, and bind and map the next VBO to stream further vertex data. When all 18 have been used, I logically bind the first one and start over.

From my experience, using the mapping of VBOs is almost twice as fast as using heap arrays for vertex data. How do I know? Because, since I render 3 millions of GL_QUADS, the frame rate is significantly lower than 30 fps. I simply can observe with my eyes how the frame rate is twice for VBOs.

I also made the observation, that calling glDrawArrays twice in succession on each filled vertex buffer object (resulting in twice as many quads to be rendered, but once the effort to stream the vertex data) - to be only insignificantly slower than rendering only once. Therefore, I assume the major bottleneck to be the streaming of vertex data into vertex buffer objects (a 2 GHz dualcore is 60% busy with it!!).

Right now each vertex takes 3 floats plus 2 floats for texture coord. (20 bytes in total). I guess I could shorten that amount to 3 GL_SHORT plus 2 GL_SHORT for texture coord. using translation matrices (5 bytes in total), but that would speed up only by 4 times. (And somehow sizeof(GL_SHORT) gives 4 on my system, so I'm not sure about that either.)

At any rate, there are games out there, which even are pretty old already but render far more than 3 millions of primitives onto the screen each frame afaik (and they inevitably have to stream these vertices, because no GPU could hold so much data) and still get decent framerates of over 100 fps!

I am sure, I'm still missing some important point in the process, but I just can't figure out what it is. Any suggestions?

EDIT: These are loose quads like in a particle system. (Or rather because each might end up having a different texture on it (Textures are taken from subtextures of a single one, so no extensive binding ;) ))

Solution

I am sure, I'm still missing some important point in the process

The point should be Do I need to draw 3 MB of triangles?, instead of How can I break the hardware limit?

The limit you shall acknoledge should be hardware. Transfer rates, GPU clock and memory clock are characteristic that cannot be overriden without newer hardware. Indeed you should try to make an efficient use of the current hardware.

As I can understand, you need to update vertex buffers while rendering; so you map the buffer object, updates the data, unmap and render. I suppose you do it repeatly. In this case you have to consider the transfer rate from CPU to GPU; can you reduce the data required for rendering the scene? Maybe interpolating key vertex positions?

For instance, if I need to render a terrain, I can send billions of triangles to render a perfect terrain. But surely I can reach the same result by using only the most important one. Using less triangles without distorting the good result, make me able to do more and more.

At 1920x1080 there are 2 MB of pixels... I need to use 2 MB of triangles for drawing it? Maybe a fragment shader would perform better.

There are many techiques used for reducing processing loads (both on CPU and GPU) and transfer rates:

culling
level of detail
instanced rendering
key-frame animation
skeletal animation

OTHER TIPS

There are actually quite a few things you can do (or which are done to get more throughput). I am just skimming a few, as this can (and does) fill a (or more) book(s), though.

Draw triangles, not quads. Ultimately, the quads will be splitted to tris anyway (graphics hardware is optimized for triangle processing).
When you have big objects consisting of so many triangles, you are going to use strips and fans wherever you can (reduces the amount of vertex data to be sent from 3N to N+2).
Clever caching of the data (especially when rendering large scenes) is vital. As you have observed, data transfer is the bottleneck in the system. So basically every engine is heavily optimized to avoid unnecessary data transfers. This is application dependent though. There this is also a topic where many books can be written about (and have been).

I can recommend these books as an entry to the topic:

http://www.realtimerendering.com/

http://www.gameenginebook.com/

Every vertex position is computed using compressed data stored on the heap and a simple computation.

Maybe vertex or geometry shader can do it instead?

At any rate, there are games out there, which even are pretty old already but render far more than 3 millions of primitives onto the screen each frame afaik (and they inevitably have to stream these vertices, because no GPU could hold so much data)

3000000 * 20 bytes = 60 megabytes that easily within reach of even older GPUs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow