On a recent SO question, I explained how calling a RenderScript kernel multiple times will effectively force all threads to be globally synchronized between calls.
I am currently working with multiple convolutions applied in sequence to image data. Since the convolution algorithm requires reading surrounding pixel data of the input image, I have implemented a workflow where my own custom kernel is called multiple times -- to make sure that at every step, all data from the previous convolution is ready and available at the correct coordinates. This technique has worked great for me so far.
However, in my constant quest for optimization, I have noticed that there is much performance to be obtained by keeping intermediate values in local registers for a thread, instead of writing them back to the global memory allocation in between kernel calls. If I were able to chain these convolutions in such a way, things would run much quicker. The problem is obviously that accessing the registers of surrounding threads is not really possible. Furthermore, this would require threads to run in synch to make sure these intermediate values in between stages get calculated in the expected order.
In CUDA and OpenCL, these issues are very common, and are addressed by well-known barrier synchronization + shared memory tiling techniques, which in turn depend on the concept of CUDA thread blocks or OpenCL work groups. I believe these concepts are non-existent in RenderScript, as this issue is very much tied to the wildly different architectures between desktop-class GPU's and mobile SoC's.
So my obvious question here is, are such things possible in RenderScript? That is, better management of threads and possibly thread groups for quicker data sharing among them.
On the Google I/O 2013 RenderScript talk by Jason Sams and Tim Murray, it is discussed how Script Groups might be able to do some behind the scenes optimizations, such as cross-device parallelization, memory tiling, and kernel fusion; all this by analyzing at runtime the dependency DAG in the group, and either automatically creating allocations where needed or possibly optimizing them away. I'm assuming this last bit referes to fusing kernels so that they work off their own local data, kind of how I mentioned above keeping data in local registers and combining separate steps inside a single kernel.
All this seems very much in line with what I'm looking for, especially since my application is indeed a well-defined DAG of inter-dependent operations (for a Convolutional Neural Network). So if Script Groups are indeed a plausible mobile-centric alternative to these mechanisms, I'm wondering if there is any way of influencing how and where these optimizations happen. Or if not, how much can the runtime be trusted to make the correct inference from my data dependencies given the hardware its running on -- in the specific case of "surrounding" pixel data access of the convolutional algorithm.
I realize this might all still be work in pogress, and methods would be highly hardware dependent at this point. So if there is no straight solution for such matters at the present time -- I'd be very much willing to accept a speculative answer on how this kind of workflow might potentially be approached by RenderScript in future releases.
I'd be immensely grateful on some insight about this, as it would greatly affect the development direction of my own project going forward, not to mention there are surely many other people out there wondering how such general parallel computing tasks can be handled in RS.
Thank you very much!