storing values between iterations (cache-like mechanism) in pyCUDA

https://stackoverflow.com/questions/21550853

06-10-2022
|

سؤال

Good morning all,

I am kind of newbie with cuda/pyCuda, so probably this will have an easy solution employing some mechanism that I don't know....

I am employing pycuda to operate over pairs of values: I subtract the smallest from the biggest and then perform some time-consuming operations. As it must be repeated many times, it is well suited for GPUs.

However, most of the times the result of the substraction is the same. Then, performing the time-consuming operations make no sense. what I do in the non-GPU version of my code is something like:

myFunction(A,B):
    index=A-B
        try:
        value = myDictionary[index]
    except:
        value = expensiveOperation(index)
        myDictionary[index] = value
    return value

As accessing the dictionary is much faster than expensiveOperation, and the value is found most of the times, I obtain a significant time gain.

When porting this to GPUs, I can call to myFunction(A,B) with a high degree of parallelism, which is great. However, I don't know how could I employ this dictionary mechanism -or a similar one- to avoid redundant operations.

any thoughts on this?

Thanks for your help

edit: I would like to know, is it possible to store the dictionary on the GPU, or should I copy it every time? If it's on the GPU, can it be accessed/edited by several cores at the same time? How should I implement it?

المحلول 2

It seems your question is about implementing some sort of memoise facility inside GPU code. I don't think this is worth pursuing. In the GPU arithmetic operations are almost free, but memory access is very expensive (and random memory access even more so). Performing a dictionary/hash table look-up in GPU memory to retrieve an arithmetic result from a cache is almost guaranteed to be slower that the cost of just calculating the result. It sounds counter-intuitive, but that is the reality of GPU computing.

In an interpreted language like Python, which is relatively slow, using a fast native memoisation mechanism makes a lot of sense, and memoising the results of a complete kernel function call also could yield useful performance benefits for expensive kernels. But memoisation inside CUDA doesn't seem all that useful.

نصائح أخرى

You could try this:

myFunction(A,B):
    index=A-B
    if index in myDictionary.keys():
        value = myDictionary[index]
    else:
        value = expensiveOperation(index)
        myDictionary[index] = value
    return value

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow