It seems your question is about implementing some sort of memoise facility inside GPU code. I don't think this is worth pursuing. In the GPU arithmetic operations are almost free, but memory access is very expensive (and random memory access even more so). Performing a dictionary/hash table look-up in GPU memory to retrieve an arithmetic result from a cache is almost guaranteed to be slower that the cost of just calculating the result. It sounds counter-intuitive, but that is the reality of GPU computing.
In an interpreted language like Python, which is relatively slow, using a fast native memoisation mechanism makes a lot of sense, and memoising the results of a complete kernel function call also could yield useful performance benefits for expensive kernels. But memoisation inside CUDA doesn't seem all that useful.