Question

I am trying to copy data to tile_static for a long run running process. In all of the examples I have seen an array is declared and data is filled in piece by piece for each thread in the tile. Those threads then share that data. What I am wanting to do is just copy some data with tile_static for use by a single thread. I don't need to share it but since it is heavily for a long running thread my understanding is that it would improve performance. I am not sure if this is the right way to go about it though. The tile_static call I am trying to make is near the bottom in the parallel_for_each loop and looks like this:

tile_static vector<int_2> route = av_RouteSet[t_idx.global[0]];

I've included additional code for clarity.

vector<float> tiledTSPCompute(accelerator_view accl, city_set CityLocations, int NumberOfTiles,
float StartTemp, float EndTemp, float CoolingCoefficient, unsigned int MovesPerTemp){
    // Setting tile size
    static const int TS = 16;
    // Setting number of runs in terms of number of tiles
    int NumberOfRuns = NumberOfTiles * TS * TS;
    // Get results vector ready
    vector<float> Results(NumberOfRuns);
    array_view<float> av_Results(Results);
    // Get routes ready
    vector<int_2> RouteSet(sizeof(CityLocations.Cities) * NumberOfRuns);
    array_view<int_2, 2> av_RouteSet(NumberOfRuns, sizeof(CityLocations.Cities), RouteSet);
    // Prepare extent
    concurrency::extent<1> e(NumberOfRuns);
    // Create RNG
    tinymt_collection<1> mtSet(e, 500);

    concurrency::parallel_for_each(accl, av_Results.extent.tile<TS, TS>(), [=](tiled_index<TS, TS> t_idx)restrict(amp){
        auto& mt = mtSet[t_idx.global];
        //What I would like to do
        tile_static vector<int_2> route = av_RouteSet[t_idx.global[0]];

        Tiled_InitializeRoute(route);
        Tiled_RandomizeRoute(route, mt);
        Tiled_HeuristicRun(StartTemp, EndTemp, CoolingCoefficient, CityLocations, route, MovesPerTemp, mt);
        av_Results[t_idx.global] = Tiled_TotalRouteDistance(route, CityLocations);
    });
};
Was it helpful?

Solution

Tiled memory, as the name implies, is memory that is available per tile. It's primary use is sharing memory between threads in a tile. That's why you see the common pattern. A group of threads loads tile static memory (in parallel), carries out reads and writes to that memory, often with barriers to prevent race conditions, and finally saves a result to global memory. tile local memory is more efficient than reading from global memory.

However in your example you are not taking advantage of these properties of tile memory. You would be better off using local memory to store this data as you are not sharing it between threads. The same goes for the mtSet array. You should declare those arrays local to the kernel and initialize them there. If either of them is constant then you should declare it as such to allow them to use constant memory, rather than local memory.

Depending on how large this data is you may run into occupancy issues. The amount of local memory is very limited, typically 10s of KB. If you use to much per thread then the GPU cannot schedule more warps which limits its ability to hide latency by scheduling additional warps when existing ones are blocked. You may want to think about re-partitioning your work on each thread if this seems to be an issue.

Most of this is covered in the chapters on Optimization and Performance in my C++ AMP book. The following also contains a good overview of different types of GPU memory, although is written in terms of CUDA, not AMP.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top