You need a reduction to partially add the m elements in a @Local annotated array(used for local memory in aparapi). Lets say there are N total elements so you need N/m partial sums that needs faster bandwidth(so in local memory).
Also you should use localBarrier(); to synch the cores in compute units or work units in local work groups. Then send the data to main memory.
Very short example:
You need 1+2+3+4+5+6+..... and you have 3 cores per compute unit of gpu.
1+2 is done in core number 1
3+4 is done in core number 2
5+6 is done in core number 3
then you add cores' outputs in local memory are which is shared by all neighbour cores of that compute unit.
At last you get 3,7,11 in an array which are summed to 21 in local memory.
Up all these datas of all compute units to main memory such as 21,57,.... then you can add them all simply in cpu .
Of course there will be (+/-)(1/(2*n+1)) instead of 1,2,3,4,5