If you are using a single stream, it doesn't make a difference whether you will synchronize that one stream or you use cudaDeviceSynchronize()
. In terms of performance and effect it should be exactly the same. Note that when using events to time part of your code (e.g., a cublas call) it's always good practice to call cudaDeviceSynchronize()
to get meaningful measurements. From my experience, it doesn't impose any significant overhead and, besides, it's safer to time your kernels with it.
If your application uses multiple streams, then it makes sense to synchronize only against the stream you want. I believe that this question will be helpful to you. Also, you can read the CUDA C Programming guide, Section 3.2.5.5.