Pregunta

On a recent SO question, I explained how calling a RenderScript kernel multiple times will effectively force all threads to be globally synchronized between calls.

I am currently working with multiple convolutions applied in sequence to image data. Since the convolution algorithm requires reading surrounding pixel data of the input image, I have implemented a workflow where my own custom kernel is called multiple times -- to make sure that at every step, all data from the previous convolution is ready and available at the correct coordinates. This technique has worked great for me so far.

However, in my constant quest for optimization, I have noticed that there is much performance to be obtained by keeping intermediate values in local registers for a thread, instead of writing them back to the global memory allocation in between kernel calls. If I were able to chain these convolutions in such a way, things would run much quicker. The problem is obviously that accessing the registers of surrounding threads is not really possible. Furthermore, this would require threads to run in synch to make sure these intermediate values in between stages get calculated in the expected order.

In CUDA and OpenCL, these issues are very common, and are addressed by well-known barrier synchronization + shared memory tiling techniques, which in turn depend on the concept of CUDA thread blocks or OpenCL work groups. I believe these concepts are non-existent in RenderScript, as this issue is very much tied to the wildly different architectures between desktop-class GPU's and mobile SoC's.

So my obvious question here is, are such things possible in RenderScript? That is, better management of threads and possibly thread groups for quicker data sharing among them.


On the Google I/O 2013 RenderScript talk by Jason Sams and Tim Murray, it is discussed how Script Groups might be able to do some behind the scenes optimizations, such as cross-device parallelization, memory tiling, and kernel fusion; all this by analyzing at runtime the dependency DAG in the group, and either automatically creating allocations where needed or possibly optimizing them away. I'm assuming this last bit referes to fusing kernels so that they work off their own local data, kind of how I mentioned above keeping data in local registers and combining separate steps inside a single kernel.

All this seems very much in line with what I'm looking for, especially since my application is indeed a well-defined DAG of inter-dependent operations (for a Convolutional Neural Network). So if Script Groups are indeed a plausible mobile-centric alternative to these mechanisms, I'm wondering if there is any way of influencing how and where these optimizations happen. Or if not, how much can the runtime be trusted to make the correct inference from my data dependencies given the hardware its running on -- in the specific case of "surrounding" pixel data access of the convolutional algorithm.

I realize this might all still be work in pogress, and methods would be highly hardware dependent at this point. So if there is no straight solution for such matters at the present time -- I'd be very much willing to accept a speculative answer on how this kind of workflow might potentially be approached by RenderScript in future releases.

I'd be immensely grateful on some insight about this, as it would greatly affect the development direction of my own project going forward, not to mention there are surely many other people out there wondering how such general parallel computing tasks can be handled in RS.

Thank you very much!

¿Fue útil?

Solución

As you've discovered, there's no way in RS to directly share data across threads. However, what you are describing can be done using a ScriptGroup. The catch is that each script in the group has to be unique, so you cannot feed your same script over and over. At least, not as it is written now. You could certainly put the "core" of your script in a RS header and include it from multiple kernels. The ScriptGroup allows you to have the output from one script become the input of another, or the output of one script becomes a global field in another. The documentation states that the kernel to kernel (output to input) is the more efficient use case. Using this approach, your synchronization issue would be resolved as the engine will execute the first script against the entire input data set before starting the second script, etc. The scripts themselves will be parallelized appropriately for the hardware (using either CPU or GPU/DSP). The engine will not have to pop back out to Java between scripts and can also manage the data allocations behind the scenes, if needed.

Something you may notice is the ScriptGroup utilizes Script.KernelID or Script.FieldID in order to identify the exact script or field in which to connect two kernels. Your custom scripts have these things auto-generated as long as you explicitly call out your kernel function using the RS compiler attribute pragma. Then you can call getKernelID_<name> (where 'name' is the kernel function name from your script) to get the kernel ID.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top