cublas one function call produced three executions

Question

Sorry, not an answer - but slightly too long for a comment:

Concerning the edit, about the influence of the "transpose" state: Transposing a matrix might cause an access pattern that is worse in terms of memory coalescing. A quick websearch brings brings some results about this ( https://devtalk.nvidia.com/default/topic/528450/cuda-programming-and-performance/cublas-related-question/post/3734986/#3734986 ), but with a slightly different setup than yours:

DGEMM performance on a K20c

args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.13280010 sec GFLOPS=1034.93

args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.13872910 sec GFLOPS=990.7

args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.12521601 sec GFLOPS=1097.61

args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096 elapsed = 0.13652611 sec GFLOPS=1006.69

In this case, the differences do not seem worth the hassle of changing the matrix storage (e.g. from column-major to row-major, to avoid transposing the matrix), because all patterns seem to run with a similar speed. But your mileage may vary - particularly, the difference in your tests between (t,n) and (n,n) are very large (548ms vs 340ms), which I found quite surprising. If you have the choice to easily switch between various representations of the matrix, then a benchmark covering all the four cases may be worthwhile.

In any case, regarding your question about the functions that are called there: The CUBLAS code for the sgemm function in CUBLAS 1.1 was already full of unrolled loops and already contained 80 (!) versions of the sgemm function for different cases that have been assembled using a #define-hell. It has to be assumed that this has become even more unreadable in the newer CUBLAS versions, where the newer compute capabilities have to be taken into account - and the function names that you found there indicated that this indeed is the case:

sgemm_sm35_ldg_nt_64x16x128x8x32

sm35 : Runs on a device with compute capability 3.5
ldg : ? Non-texture-memory version ? (CUBLAS 1.1 contained functions called sgemm_main_tex_* which worked on texture memory, and functions sgemm_main_gld_* which worked on normal, global memory)
nt : First matrix is Not transposed, second one is Transposed
64x16x128x8x32 - Probably related to tile sizes, maybe shared memory etc...

Still, I think it's surprising that a single call to sgemm causes three of these internal functions to be called. But as mentioned in the comment: I assume that they try to handle the "main" part of the matrix with a specialized, efficient version, and "border tiles" with one that is capable of doing range checks and/or cope with warps that are not full. (Not very precise, just to be suggestive: A matrix of size 288x288 could be handled by an efficient core for matrices of size 256x256, and two calls for the remaining 32x288 and 288x32 entries).

But all this is also the reason why I guess there can hardly be a general guideline concerning the matrix sizes: The "best" matrix size in terms of computation time over matrix size will at least depend on

the hardware version (compute capability) of the target system
the transposing-flags
the CUBLAS version

EDIT Concerning the comment: One could imagine that there should be a considerable difference between the transosed and the non-transposed processing. When multiplying two matrices

a00 a01 a02     b00 b01 b02
a10 a11 a12  *  b10 b11 b12
a20 a21 a22     b20 b21 b22

Then the first element of the result will be

a00 * b00 + a01 * b10 + a02 * b20

(which simply is the dot product of the first row of a and the first column of b). For this computation one has to read consecutive values from a. But the values that are read from b are not consecutive. Instead, they are "the first value in each row". One could think that this would have a negative impact on memory coalescing. But for sure, the NVIDIA engineers have tried hard to avoid any negative impact here, and the implementation of sgemm in CUBLAS is far, far away from "a parallel version of the naive 3-nested-loops implementation" where this access pattern would have such an obvious drawback.