Question

I am currently in the process of tiling my c++ AMP Code. For each tile, I have 4096 bytes of data which are read from frequently, so I would like to declare this as tile_static. It is not practical to divide this into multiple tiles, as each thread requires access to all of the data. My tiles consist of 128 threads, so they should take up 2-4 warps on Nvidia/AMD GPUs.

I just read the following article, which seems to suggest that I can only use 1024 bits in tile_static per warp: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/08/14/avoid-bank-conflicts-on-tile-static-memory-with-c-amp.aspx

On some modern GPUs, tile_static memory consists of “n” equally sized memory banks which can be accessed simultaneously, and successive “m”-bit words are mapped to successive memory banks. The exact organization of tile_static memory (i.e. n and m) is hardware dependent. For example, on an Nvidia GTX 580 card or an ATI HD 5870 card, tile_static memory has 32 banks (n = 32) that are organized such that successive 32-bit words (m = 32) map to successive memory banks. Note that n might be different from hardware to hardware, m is usually 32. In the rest of the post, I will assume m is 32.

Does this mean that I can declare up to 1024bits per warp, or per thread? Are all tile_static variables shared between warps, or does each warp have its own copy?

How much of these questions are hardware-dependent, and if so, how can I find out the limitations at runtime?

I have read a c++ AMP book cover to cover, and while I am thankful to the authors for introducing me to the subject, it did not seem to address this question (or if it did, I didn't understand it).

There is a wealth of info online about how to use tile_static memory (this one is a good start: http://www.danielmoth.com/Blog/tilestatic-Tilebarrier-And-Tiled-Matrix-Multiplication-With-C-AMP.aspx) but no one seems to talk about how much we can declare, making it impossible to actually implement any of this stuff! That last link gives the following example:

01: void MatrixMultiplyTiled(vector<float>& vC, 
         const vector<float>& vA, 
         const vector<float>& vB, int M, int N, int W)
02: {
03:   static const int TS = 16;

04:   array_view<const float,2> a(M, W, vA);
05:   array_view<const float,2> b(W, N, vB);
06:   array_view<float,2> c(M,N,vC); c.discard_data();

07:   parallel_for_each(c.extent.tile< TS, TS >(),
08:   [=] (tiled_index< TS, TS> t_idx) restrict(amp) 
09:   {
10:     int row = t_idx.local[0]; int col = t_idx.local[1];
11:     float sum = 0.0f;

12:     for (int i = 0; i < W; i += TS) {
13:        tile_static float locA[TS][TS], locB[TS][TS];
14:        locA[row][col] = a(t_idx.global[0], col + i);
15:        locB[row][col] = b(row + i, t_idx.global[1]);
16:        t_idx.barrier.wait();

17:        for (int k = 0; k < TS; k++)
18:          sum += locA[row][k] * locB[k][col];

19:        t_idx.barrier.wait();
20:     }

21:     c[t_idx.global] = sum;
22:   });
23: }

Note that line 13 declares 2x 1024 bits, which makes me hopeful that my 4096 bits isn't too much to ask for.... If anyone with some experience in c++ amp or GPU programming in general could help me out that would be great - I imagine these questions are more dependent on hardware/implementation than the AMP language extension itself...

Was it helpful?

Solution

Firstly I think you mean 2 x 1024 bytes not bits. Tile static memory is declared per tile, not per thread or per warp. A warp is really just a scheduling construct for organizing groups of threads which execute together, typically in groups of 32 or 64 depending on the architecture. In order to make life easier on the scheduler you should use tiles that contain a number of threads that is an exact multiple of the warp size.

Before discussing this in detail, it’s helpful to revisit how GPUs execute the threads that make up your kernel. GPUs consist of several processors. AMD refers to them as Compute Units whereas NVIDIA calls them Streaming Multiprocessors. Each CU schedules work in chunks or bundles of threads referred to as warps. When a warp is blocked, the CU scheduler can hide latencies by switching to another warp rather than waiting for the current warp. CUs are able to use this approach to hide the latencies associated with memory accesses, provided that sufficient warps are available.

One of the reasons that this is not covered in great detail in the book is because C++ AMP is designed to be hardware agnostic and runs on top of DirectX. So if you design your application with specific GPU details in mind it may become less portable. In addition because C++ AMP is implemented on top of DX11 in some cases there is simply no way to get at the hardware specific information. The warp size, tile_static memory size and cache sizes are all examples of this. As with any book we also had space constraints and publication deadlines.

However you can make reasonable assumptions about the warp size and tile memory. On a modern GPU assuming a warp size of 32 or 64 and a tile static memory of the order of 10s of KB. If you really want to tune your code for a specific processor then you can either use the manufacturers specs and/or a tool that displays the appropriate details.

On Tesla, the shared memory is 16KB. On Fermi, the shared memory is actually 64KB, and can be configured as a 48KB software-managed data cache with a 16KB hardware data cache, or the other way around (16KB SW, 48KB HW cache).

10s of KB may seem like not a lot of memory for a tile_static array but in reality there are other pressures that will also dictate tile size; registry pressure for one. You should also remember that a few very large tiles usually results in a low occupancy and thus inefficient code.

I agree that the whole memory bank terminology is confusing. 32 bits refers to the size of the memory bank, not the size of the total memory. You can think of a bank as the access mechanism, rather than the total storage. As the reference above notes each 32 bit bank is mapped to successive 32 bit memory addresses. Because you get one bank access per cycle the most efficient way to access memory is to read one item from each bank. Or, to have all the threads read the same item (broadcast). The book contains some discussion of this in the performance/optimization chapter.

OTHER TIPS

It's about 32kb. If you hit the limit you'll get an error when you try to compile.

If you're not getting an error, you're okay. You should be able to test it out yourself by declaring a massive tile_static array and you should get an angry message telling you what the tile_static limit is.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top