Is texture cache faster than L2 cache?

Question 1

All texture transactions also flow through the L2 cache, so generally speaking it's rare for texture to ever be slower than L2. You can think of the texture cache on Kepler like an alternate L1 cache for read-only data.

The specifics of the texture cache are complex and aren't really well documented, but it is a very high performance cache, especially for streaming access patterns with some amount of locality and reuse between different threads in the same warp or thread block.

One important point about the texture cache is that, unlike a CPU L1 cache, it's not really designed to decrease latency. Rather, it is designed to be a "bandwidth aggregator" which aggregates simultaneous loads from many threads and tries to stream the results back to the processing units as efficiently as possible. This means that the memory system can fetch the same amount of data in fewer total transactions.

Without more info, it's difficult to say whether use of the texture cache (for instance via the ldg instrinsic on Kepler) will improve performance for any particular access pattern, but if your kernel is bandwidth bound it is usually worth a try.

In regards to your specific point, yes, data which hits in the texture cache will not have to go out to the L2. However, again due to specifics of the texture cache, this is usually a smaller effect than the impact of bandwidth aggregation which results in fewer total memory transactions send to the L2.

Question 2

I have some benchmarks that might explain some things. Two carry out the same calculations on a PC CPU and graphics card using CUDA. There is also another using OpenMP. Note that the GPU shown below has 128 cores with multiple registers and it is these that can provide outstanding performance, rather than memory speed.

http://www.roylongbottom.org.uk/linux_cuda_mflops.htm

http://www.roylongbottom.org.uk/linux%20multithreading%20benchmarks.htm#anchor5

The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. Below is a summary of some of the results.

For calculations that generate output to be used by the CPU, a major hurdle is the relatively slow graphics bus speed. The first result involves 400 KB transferred to the graphics card and back, 2500 times. Speed is doubled on using the same data with output only. Fastest is when all the data is not returned (a few sums could be). Here. the CPU based calculations from L2 cache are much faster.

The second set is using data from RAM on the CPU and can be slower than CUDA calculations where data transfer is not involved. The next two sets of data have 32 operations per data word, where the number of GPU cores can lead to better performance than using the CPU.

The final calculations include loop control running on the GPU, including using the GPU shared memory (cache). For best performance, there should be minimum data in and out and lots of calculations on each data word.

  AMD Phenom(tm) II X4 945 Processor  3000 MHz 4 Cores 
  GeForce GTS 250  with 16 Processors 128 cores 
  Global Memory 999 MB, Shared Memory/Block 16384 B, Max Threads/Block 512

                  GTS 250 ------------------------------   Phenom
  Test            4 Byte  Ops  Repeat   Seconds   MFLOPS   MFLOPS
                   Words  /Wd  Passes                     4 Threads

 Data in & out    100000    2    2500  1.035893      483    22321
 Data out only    100000    2    2500  0.514445      972
 Calculate only   100000    2    2500  0.082464     6063

 Data in & out  10000000    2      25  0.639933      781     3240
 Data out only  10000000    2      25  0.339051     1475
 Calculate only 10000000    2      25  0.041672    11999

 Data in & out    100000   32    2500  1.057142     7568    58670
 Data out only    100000   32    2500  0.531691    15046
 Calculate only   100000   32    2500  0.128706    62157

 Data in & out  10000000   32      25  0.644074    12421    45377
 Data out only  10000000   32      25  0.357000    22409
 Calculate only 10000000   32      25  0.062001   129029

 Extra tests - loop in main CUDA Function

 Calculate      10000000    2      25  0.050288     9943
 Shared Memory  10000000    2      25  0.009206    54313

 Calculate      10000000   32      25  0.050531   158320
 Shared Memory  10000000   32      25  0.046626   171580