How much memory can I declare as tile_static?

Question 1

Firstly I think you mean 2 x 1024 bytes not bits. Tile static memory is declared per tile, not per thread or per warp. A warp is really just a scheduling construct for organizing groups of threads which execute together, typically in groups of 32 or 64 depending on the architecture. In order to make life easier on the scheduler you should use tiles that contain a number of threads that is an exact multiple of the warp size.

Before discussing this in detail, it’s helpful to revisit how GPUs execute the threads that make up your kernel. GPUs consist of several processors. AMD refers to them as Compute Units whereas NVIDIA calls them Streaming Multiprocessors. Each CU schedules work in chunks or bundles of threads referred to as warps. When a warp is blocked, the CU scheduler can hide latencies by switching to another warp rather than waiting for the current warp. CUs are able to use this approach to hide the latencies associated with memory accesses, provided that sufficient warps are available.

One of the reasons that this is not covered in great detail in the book is because C++ AMP is designed to be hardware agnostic and runs on top of DirectX. So if you design your application with specific GPU details in mind it may become less portable. In addition because C++ AMP is implemented on top of DX11 in some cases there is simply no way to get at the hardware specific information. The warp size, tile_static memory size and cache sizes are all examples of this. As with any book we also had space constraints and publication deadlines.

However you can make reasonable assumptions about the warp size and tile memory. On a modern GPU assuming a warp size of 32 or 64 and a tile static memory of the order of 10s of KB. If you really want to tune your code for a specific processor then you can either use the manufacturers specs and/or a tool that displays the appropriate details.

On Tesla, the shared memory is 16KB. On Fermi, the shared memory is actually 64KB, and can be configured as a 48KB software-managed data cache with a 16KB hardware data cache, or the other way around (16KB SW, 48KB HW cache).

Understanding the CUDA Data Parallel Threading Model A Primer

10s of KB may seem like not a lot of memory for a tile_static array but in reality there are other pressures that will also dictate tile size; registry pressure for one. You should also remember that a few very large tiles usually results in a low occupancy and thus inefficient code.

I agree that the whole memory bank terminology is confusing. 32 bits refers to the size of the memory bank, not the size of the total memory. You can think of a bank as the access mechanism, rather than the total storage. As the reference above notes each 32 bit bank is mapped to successive 32 bit memory addresses. Because you get one bank access per cycle the most efficient way to access memory is to read one item from each bank. Or, to have all the threads read the same item (broadcast). The book contains some discussion of this in the performance/optimization chapter.

Question 2

It's about 32kb. If you hit the limit you'll get an error when you try to compile.

If you're not getting an error, you're okay. You should be able to test it out yourself by declaring a massive tile_static array and you should get an angry message telling you what the tile_static limit is.