Non-interleaved vertex buffers DirectX11

Question 1

How to

There are several ways. I'll describe the simplest one.

Just create separate vertex buffers:

ID3D11Buffer* positions;
ID3D11Buffer* texcoords;
ID3D11Buffer* normals;

Create input layout elements, incrementing InputSlot member for each component:

{ "POSITION",  0,  DXGI_FORMAT_R32G32B32_FLOAT,  0, 0,                            D3D11_INPUT_PER_VERTEX_DATA, 0 },
{ "TEXCOORD",  0,  DXGI_FORMAT_R32G32_FLOAT,     1, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0 },
{ "NORMAL",    0,  DXGI_FORMAT_R32G32B32_FLOAT,  2, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0 },
                                             //  ^
                                             // InputSlot

Bind buffers to their slots (better all in one shot):

ID3D11Buffer** vbs = {positions, texcoords, normals};
unsigned int strides[] = { /*strides go here*/ };
unsigned int offsets [] = { /*offsets go here*/ };
m_Context->IASetVertexBuffers(0, 3, vbs, strides, offsets );

Draw as usual. You don't need to change HLSL code (HLSL will think as it have single buffer).

Note, that code snippets was written on-the-fly and can contain mistakes.

Edit: you can improve this approach, combining buffers by update rate: if texcoords and normals never changed, merge them.

As of performance

It is all about locality of references: the closer data, the faster access.

Interleaved buffer, in most cases, gives (by far) more performance for GPU side (i.e. rendering): for each vertex each attribute near each other. But separate buffers gives faster CPU access: arrays are contiguous, each next data is near previous.

So, overall, performance concerns depends on how often you writing to buffers. If your limiting factor is CPU writes, stick to separate buffers. If not, go for single one.

How will you know? Only one way - profile. Both, CPU side, and GPU side (via Graphics debugger/profiler from your GPU's vendor).

Another factors

The best practice is to limit CPU writes, so, if you will find that you are limited by buffer updating, you probably need to re-view your approach. Do we need to update buffer each frame if we have 500 fps? User won't see difference if you reduce buffer update rate to 30-60 times per second (unbind buffer update from frame update). So, if your updating strategy is reasonable, you will likely never be CPU-limited and best approach is classic interleaving.

You can also consider re-designing your data pipeline, or even somehow prepare data offline (we call it "baking"), so you will not need to cope with non-interleaved buffers. That will be quite reasonable too.

Reduce memory footprint or increase performance?

Memory-to-performance tradeoff. This is the eternal question. Duplicate memory to take advantages of interleaving? Or not?

Answer is... "that depends". You are programming new CryEngine, targeting top GPUs with gigabytes of memory? Or you're programming for embedded systems of mobile platform, where memory resources slow and limited? Does 1 megabyte memory worth hassle at all? Or you have huge models, 100 MB each? We don't know.

It's all up to you to decide. But remember: there are no free candies. If you'll find memory economy worth performance loss, do it. Profile and compare to be sure.

Hope it helps somehow. Happy coding! =)

Question 2

Interleaved/Separate will mostly affect your Input Assembler stage (GPU side).

A perfect scenario for Interleaved is when your Buffer memory arrangement perfectly fits your vertex shader input. So your Input assembler can simply fetch the data.

In that case you'll be totally fine with interleaved, even tho testing with a large model (two versions of the same data, one interleaved, one separate), TimeStamp query didn't reported any major difference (some pretty minimal vertex processing and basic pixel shader).

Now having separate buffers makes it much easier to fine tune in case you use your geometry in different contexts.

Let's say you have Position/Normals/UV (like in your case).

Now you also have a shader in your pipeline that only requires Position (Shadow Map would be a pretty good example).

With separate buffers, you can simply create a new input layout which contains position only, And bind that buffer instead. Your IA stage has only to load that buffer. Best of all you can even do that dynamically using shader reflection.

If you bind Interleaved data, you will have some overhead due to the fact it has to load with a stride.

When I tested that one I had about 20% gain using Separate instead of Interleaved, which can be quite decent, but since this type of processing can be largely architecture dependent, don't take it for granted (NVidia 740M for testing).

So simply put, profile (a lot), and check which gives you the best balance between your GPU and CPU loads.

Please also note that the overhead from Input Assembler will decrease from the complexity of your shader, if you apply some heavy calculations + add some tessellation + some decent shading, the time difference between interleaved/non interleaved will become progressively meaningless.

Question 3

You should stick with interleaved buffers. Any other technique will require some form of indirection to your non-duplicated position buffer, which will cost you performance and cache efficiency.