How threads/blocks are mapped on GPU while calling cublasSgemm/clAmdBlasSgemm routines?

Question 1

For the host side CUBLAS API (note that I have no idea why you would assume that clAmdBlasSgemm would be the same), the short answer to your questions are as follows:

Modern CUBLAS is closed source. There are code bases like Magma which you could look at to at least get a feel for how CUBLAS might be implemented. You can also run CUBLAS code in one of the NVIDIA supplied profilers to see what it does on the GPU. But the point is that you don't need to know how it works. There is an API and some very thorough documentation. That is all you need to know.
You example problem requires roughly 1.2Gb of memory. If you have a GPU with that much memory, and either enough computational capacity to avoid the display driver watchdog timer, or a compute dedicated GPU, it will work. Memory and the display driver time limitations (where applicable) are the only limitations.
No.

Note that there is also a CUBLAS device API for K20 Kepler devices, and the answers I provided above do not apply to that library.

Question 2

Before going any further you must read the papers of Volkov and Demmel, have a look here: http://www.cs.berkeley.edu/~volkov/ see his article regarding SGEMM. The answers are there since 2008.