It's conceivable to use CUDA to reduce multiple arrays "in parallel". Reduction (summation) isn't a terribly compute-intensive operation, so if the data is not already resident on the GPU, then the cost to transfer the data to the GPU is likely to be a significant part (the majority) of the overall execution time. From your description, it's not clear if you're already doing this in some fashion on the GPU or if on the CPU. But if the data is on the GPU, then summing via parallel reduction will be fastest.
Unless the data of a single array is larger than ~2GB, then the number of threads is not likely to be an issue.
You could craft a kernel which simply reduces the arrays one after the other, in sequence. It seems you are saying there are N arrays, where N is around 9000. How big is each array? If the arrays are large enough, approximately all of the power of the GPU (roughly speaking) can be brought to bear on each individual operation, there's no significant penalty in that case to reducing the arrays, one after the other. The kernel then could be a basic parallel reduction, that looped over the arrays. Should be pretty straightforward.
If you have roughly 9000 arrays to crunch, and it's not difficult to order your data in an interleaved fashion, then you might also consider an array of 9000 threads, where each thread sums the elements of a single array in a serial loop, pretty much the way you'd do it naively on CPU code. Data organization would be critical here, because the goal of all of this is to maximize utilization of available memory bandwidth. As the loop in each thread is picking up it's next data element to be summed, you would want to ensure that you have contiguous data reads amongst threads in a warp (coalesced access), thus implying an interleaved data storage arrangment amongst your N arrays. If that were the case, this approach would run quite fast as well.
By the way, you might take a look at thrust which is relatively easy to use, and provides simple operations to do sum-reductions on arrays. As a prototype, it would be relatively easy to write a loop in thrust code that iteratively summed a sequence of arrays on the GPU.