Tiled memory, as the name implies, is memory that is available per tile. It's primary use is sharing memory between threads in a tile. That's why you see the common pattern. A group of threads loads tile static memory (in parallel), carries out reads and writes to that memory, often with barriers to prevent race conditions, and finally saves a result to global memory. tile local memory is more efficient than reading from global memory.
However in your example you are not taking advantage of these properties of tile memory. You would be better off using local memory to store this data as you are not sharing it between threads. The same goes for the mtSet
array. You should declare those arrays local to the kernel and initialize them there. If either of them is constant then you should declare it as such to allow them to use constant memory, rather than local memory.
Depending on how large this data is you may run into occupancy issues. The amount of local memory is very limited, typically 10s of KB. If you use to much per thread then the GPU cannot schedule more warps which limits its ability to hide latency by scheduling additional warps when existing ones are blocked. You may want to think about re-partitioning your work on each thread if this seems to be an issue.
Most of this is covered in the chapters on Optimization and Performance in my C++ AMP book. The following also contains a good overview of different types of GPU memory, although is written in terms of CUDA, not AMP.