Question

I'm new in OpenCL and I'm trying to implement power iteration method (described over here) matrix sizes over 100000x100000!

Actually I have no idea how to implement this. It's because workgroup have restriction CL_DEVICE_MAX_WORK_GROUP_SIZE (so I can't make one workgoup with 1000000 work-items) But on each step of iterating I need to synchronize and normalize vector.

1) So is it possible to make all calculations inside one kernel? (I think that answer is no if matrix sizes is more than CL_DEVICE_MAX_WORK_GROUP_SIZE)

2) Can I make "while" loop in the host code? and is it still profitable to use GPU in this case? something like: while (condition)

{

kernel calling

synchronization

}

Was it helpful?

Solution

  • 2: Yes, you can make a while loop in host code. Whether this is still profitable in terms of performance depends on whether the kernel that is called achieves a good speedup. My personal preference is not to pack too much logic into a single kernel, because smaller kernels are easier to maintain and sometimes easier to optimize. But of course, invoking a kernel has a (small) overhead that has to be taken into account. And whether combining to kernels into one can bring a speedup (or new potential for optimizations) depends on what the kernels are actually doing. But in this case (Matrix Multiplation and Vector Normalization) I'd personally start with two different kernels that are invoked from the host in a while-loop.

  • 1: Since a 100000x100000 matrix with float values will take at least 40GB of memory, you'll have to think about the approach in general anyhow. There is a vast amount of literature on Matrix operations, their parallelization, and the corresponding implementations on the GPU. One important aspect from the "high level" point of view is whether the matrices are dense or sparse ( http://en.wikipedia.org/wiki/Sparse_matrix ). Depending on the sparsity, it might even be possible to handle 100000x100000 matrices in main memory. Apart from that, you might consider having a look at a library for matrix operations (e.g. http://viennacl.sourceforge.net/ ) because implementing an efficient matrix multiplication is challenging, particularly for sparse matrices. But if you want to go the whole way on your own: Good luck ;-) and ... the CL_DEVICE_MAX_WORK_GROUP_SIZE imposes no limitation on the problem size. In fact, the problem size (that is, the total number of work-items) in OpenCL is virtually infinitely large. If your CL_DEVICE_MAX_WORK_GROUP_SIZE is 256, and you want to handle 10000000000 elements, then you create 10000000000/256 work groups and let OpenCL care about how they are actually dispatched and executed. For matrix operations, the CL_DEVICE_MAX_WORK_GROUP_SIZE is primarily relevant when you want to use local memory (and you will have to, in order to achieve good performance): The size of the work groups thus implicitly defines how large your chunks of local memory may be.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top