kepler blocks per mp?

Question

As you said right for kepler a streaming muliprocessor can ran up to 16 threadblocks.
In your example if a thread block consists of 1024 threads, than only two blocks can launched at the same time at one mp, because in this case you will be limited by the maximum amount of threads per multiprocessor - 2048 / 1024 = 2 blocks.

There are several factors that will influence how many blocks can ran concurrently in a streaming multiprocessor. A SM have a limited amount of registers and shared memory, too. If you use too much registers or too much shared memory, than you will be limited by these factors.

A good overview for this is the CUDA occupancy calculator. With the excel sheet you can easily set up a kernel configuration for all CUDA architectures and you will see by what the kernel will be limited.
Also the CUDA programming guide provide all the required informations.

Maybe a simple example can help - done with occupancy calculator for compute capability 3.0:

If your thread block consists of 512 threads and you won't use any registers or shared memory, than the amount of parallel blocks is only influenced by the block size. For cc 3.0 per SM 2048 threads can be launched. So 2048 / 512 = 4. It's only possible to use 4 thread blocks at the same time.

In the second step you'll use additional 48 registers per thread. Per thread block 512 * 48 = 24576 registers will be used. But a SM can only use 65536 registers. Now it's only possible to run two blocks instead of four.

In the last step let's assume a block uses 32000 bytes of shared memory. Because a SM can only use 49152 bytes for shared memory, it's only possible to use 1 thread block anymore.