GPU vs CPU performance for common algorithms [closed]

https://stackoverflow.com/questions/1415679

06-07-2019
|

Question

I'm interested to know if any common algorithms (sorting, searching, graphs, etc.) have been ported to OpenCL (or any GPU language), and how the performance compares to the same algorithm executed by the CPU. I'm specifically interested in the results (numbers).

Thanks!

Solution

There are quite a few samples of this sort of thing on NVidia's website. Bear in mind that some things such as sorting need special algorithms for efficient parallelism and may not be quite as efficient as a non-threaded algorithm on a single core.

OTHER TIPS

GPUs are highly specialized hardware designed to do a small set of tasks very well and highly parallelized. This is basically arithmetic (particularly single precision floating point math although newer GPUs do quite well with double precision). As such they're only suited to particular algorithms. I'm not sure if sorting fits that category (in the general case at least).

More common examples are pricing of financial instruments, large amounts of matrix maths and even defeating encryption (by brute force). That being said, I did find Fast parallel GPU-sorting using a hybrid algorithm.

Another commonly quoted example is running SETI@HOME on an Nvidia GPU but it's comparing apples to oranges. The units of work for GPUs are different (and highly limited) compared to what CPUs ordinarily do.

Have a look at thrust:

Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust provides a flexible high-level interface for GPU programming that greatly enhances developer productivity.

BE WARY, VERY WARY of any performance numbers quoted for GPGPU. Lots of people like to post really impressive numbers that don't take into consideration the transfer time needed to get the input data from the CPU to the GPU and the output data back, both going over a PCIe bottleneck.

Image resizing must be common on many websites that accept image uploads.

Resizing a 2600ish x 2000ish 2MB jpeg image (to 512x512) took 23.5 milliseconds in C# with absolute lowest quality options and nearest neighbour sampling. Used function was graphics.DrawImage() based one. CPU usage was also %21.5.

Getting "rgba byte array" extraction on C# side and sending it to GPU and resizing in GPU and getting results back into an image took 6.3 milliseconds and CPU usage was %12.7. This was done with a %55 cheaper gpu with just 320 cores.

Only 3.73X speedup multiplier.

The limiting factor here was, sending the extracted 20MB rgb data (jpeg is only 2MB!) to GPU. That time consuming part was nearly %90 of total time, including C# side byte array extraction! So I gues there would be about 30X speedup at least if extraction part could be done in GPU too.

30X is not bad.

Then you could pipeline the extraction layer with the resizing layer to hide memory copy latency to get even more speed! This could be 40X-50X.

Then increase the quality of sampling(such as bicubic instead of nearest neighbor), you have even more advantage in GPU side. Adding a 5x5 Gaussian filter added only 0.77 milliseonds. CPU would get some higher time on top of that, especially if the Gaussian parameters needed are different than C#.Net implementation.

Even if you are not satisfied with speedup ratio, offloading to GPU and having a "free core" on the CPU is still advantageous for pushing more work to that server.

Now add the fact of GPU power consumption levels(30W vs 125W in this example), it is much more advantageous.

CPU could hardly win in

 C[i]=A[i]+B[i]

benchmarks when both sides run on optimized codes and you can still offload half of arrays to GPU and finish quicker using CPU+GPU at the same time.

GPU is not built for non-uniform works. GPUs have deep pipelines so standing up after a stall because of branching, takes too long. Also SIMD type hardware forces it to do same thing on all workitems on it. When a workitem does a different thing than the group, it loses track and adds bubbles in whole SIMD pipeline or simply others wait for sync point. So brancing affects both deep and wide pipeline areas and make it even slower than CPU in perfectly chaotic conditions.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow