Question

Your CPU may be a quad-core, but did you know that some graphics cards today have over 200 cores? We've already seen what GPU's in today's graphics cards can do when it comes to graphics. Now they can be used for non-graphical tasks as well, and in my opinion the results are nothing short of amazing. An algorithm that lends itself well to parallelism has the potential to be much, much faster on a GPU than it could ever be on a CPU.

There are a few technologies that make all of this possible:

1.) CUDA by NVidia. It seems to be the most well-known and well-documented. Unfortunately, it'll only work on NVidia video cards. I've downloaded the SDK, tried out some of the samples, and there's some awesome stuff that's being done in CUDA. But the fact that it's limited to NVidia cards makes me question its future.

2.) Stream by ATI. ATI's equivalent to CUDA. As you might expect, it will only work on ATI cards.

3.) OpenCL - The Khronos Group has put together this standard but it's still in its infancy stages. I like the idea of OpenCL though. The hope is that it should be supported by most video card manufacturers and should make cross-video card development that much easier.

But what other technologies for non-graphical GPU programming are coming and what shows the most promise? And do you see or would you like to see these technologies being built into some of the mainstream development frameworks like .NET to make it that much easier?

Was it helpful?

Solution

I foresee that this technology will become popular and mainstream, but it will take some time to do so. My guess is of about 5 to 10 years.

As you correctly noted, one major obstacle for the adoption of the technology is the lack of a common library that runs on most adapters - both ATI and nVidia. Until this is solved to an acceptable degree, the technology will not enter mainstream and will stay in the niche of custom made applications that run on specific hardware.

As for integrating it with C# and other high-level managed languages - this will take a bit longer, but XNA already demonstrates that custom shaders and managed environment can mix together - to a certain degree. Of course, the shader code is still not in C#, and there are several major obstacles to doing so.

One of the main reasons for fast execution of GPU code is that it has severe limitations on what the code can and cannot do, and it uses VRAM instead of usual RAM. This makes it difficult to bring together CPU code and GPU code. While workarounds are possible, they would practically negate the performance gain.

One possible solution that I see is to make a sub-language for C# that has its limitations, is compiled to GPU code, and has a strictly defined way of communicating with the ususal C# code. However, this would not be much different than what we have already - just more comfortable to write because of some syntactic sugar and standard library functions. Still, this too is ages away for now.

OTHER TIPS

I think you can count the next DirectX as another way to use the GPU.

From my experience, GPUs are extremely fast for algorithms that are easy to parallelize. I recently optimized a special image resizing algorithm in CUDA to be more than 100 times faster on the GPU (not even a high end one) than a quad core Intel processor. The problem was getting the data to the GPU and then fetching the result back to main memory, both directions limited by the memcpy() speed on that machine, which was less than 2 GB/s. As a result, the algorithm was only slightly faster than the CPU version...

So it really depends. If you have a scientific application where you can keep most of the data on the GPU, and all algorithms map to a GPU implementation, then fine. Else I would wait until there's a faster pipe between CPU and GPU, or let's see what ATI has up their sleeves with a combined chip...

About which technology to use: I think once you have your stuff running in CUDA, the additional step to port it to OpenCL (or another language) is not so large. You did all the heavy work by parallelizing your algorithms, and the rest is just a different 'flavor'

Monte Carlo is embarrassingly parallel, but it is a core technique in financial and scientific computing.

One of the respondents is slightly incorrect to say that most real world challenges are not decomposable easily into these types of tasks.

Much tractible scientific investigation is done by leveraging what can be expressed in an embarrassingly parallel manner.

Just because it is named "embarrassingly" parallel does not mean it is not an extremely important field.

I've worked in several financial houses, and we forsee that we can throw out farms of 1000+ montecarlo engines (many stacks of blades lined up together) for several large NVidia CUDA installations - massively decreasing power and heat costs in the data centre.

One significant architectural benefit is that there is a lot less network load also, as there are far less machines that need to be fed data and report their results.

Fundamentally however such technologies are at a level of abstraction lower than a managed runtime language such as C#, we are talking about hardware devices that run their own code on their own processors.

Integration should first be done with Matlab, Mathematica I'd expect, along with the C APIs of course...

Another technology that's coming for GPU-based processing is GPU versions of existing high-level computational libraries. Not very flashy, I know, but it has significant advantages for portable code and ease of programming.

For example, AMD's Stream 2.0 SDK includes a version of their BLAS (linear algebra) library with some of the computations implemented on the GPU. The API is exactly the same as their CPU-only version of the library that they've shipped for years and years; all that's needed is relinking the application, and it uses the GPU and runs faster.

Similarly, Dan Campbell at GTRI has been working on a CUDA implementation of the VSIPL standard for signal processing. (In particular, the sort of signal and image processing that's common in radar systems and related things like medical imaging.) Again, that's a standard interface, and applications that have been written for VSIPL implementations on other processors can simply be recompiled with this one and use the GPU's capability where appropriate.

In practice, these days already quite a lot of high-performance numerical programs do not do their own low-level programming, but rely on libraries. On Intel hardware, if you're doing number-crunching, it's generally hard to beat the Intel math libraries (MKL) for most things that it implements -- and using them means that you can get the advantages of all of the vector instructions and clever tricks in newer x86 processors, without having to specialize your code for them. With things like GPUs, I suspect this will become even more prevalent.

So I think a technology to watch is the development of general-purpose libraries that form core building blocks for applications in specific domains, in ways that capture parts of those algorithms that can be efficiently sent off to the GPU while minimizing the amount of nonportable GPU-specific cleverness required from the programmer.

(Bias disclaimer: My company has also been working on a CUDA port of our VSIPL++ library, so I'm inclined to think this is a good idea!)

Also, in an entirely different direction, you might want to check out some of the things that RapidMind is doing. Their platform was initially intended for multicore CPU-type systems, but they've been doing a good bit of work extending it to GPU computations as well.

Pretty much anything that can be paralleled may be able to benefit. More specific examples would be SETI@home, folding@home, and other distributed projects as well as scientific computing.

Especially things that heavily rely on floating point arithmetic. This is because GPUs have specialized circuitry which is VERY fast at floating point operations. This means its not as versatile, but it's VERY good at what it does do.

If you want to look at more dedicated GPU processing, check out Nvidia's Tesla GPU. It's a GPU, but it doesn't actually have a monitor output!

I doubt we will see too much GPU processing on the common desktop, or at least for a while, because not everyone has a CUDA or similar capable graphics card, if they even have a graphics card at all. It's also very difficult to make programs more parallel. Games could possibly utilize this extra power, but it will be very difficult and probably won't be too useful, since all graphics calculations are mostly already on the GPU and the other work is on the CPU and has to be on the CPU due to the instruction sets.

GPU processing, at least for a while, will be for very specific niche markets that need a lot of floating point computation.

It's important to keep in mind that even tasks that are inherently serial can benefit from parallelization if they must be performed many times independently.

Also, bear in mind that whenever anyone reports the speedup of a GPU implementation to a CPU implementation, it is almost never a fair comparison. To be truly fair, the implementers must first spend the time to create a truly optimized, parallel CPU implementation. A single Intel Core i7 965 XE CPU can achieve around 70 gigaflops in double precision today. Current high-end GPUs can do 70-80 gigaflops in double precision and around 1000 in single precision. Thus a speedup of more than 15 may imply an inefficient CPU implementation.

One important caveat with GPU computing is that it is currently "small scale". With a supercomputing facility, you can run a parallelized algorithm on hundreds or even thousands of CPU cores. In contrast, GPU "clusters" are currently limited to about 8 GPUs connected to one machine. Of course, several of these machines can be combined together, but this adds additional complexity as the data must not only pass between computers but also between GPUs. Also, there isn't yet an MPI equivalent that lets processes transparently scale to multiple GPUs across multiple machines; it must be manually implemented (possibly in combination with MPI).

Aside from this problem of scale, the other major limitation of GPUs for parallel computing is the severe restriction on memory access patterns. Random memory access is possible, but carefully planned memory access will result in many-fold better performance.

Perhaps the most promising upcoming contender is Intel's Larrabee. It has considerably better access to the CPU, system memory, and, perhaps most importantly, caching. This should give it considerable advantages with many algorithms. If it can't match the massive memory bandwidth on current GPUs, though, it may be lag behind the competition for algorithms that optimally use this bandwidth.

The current generation of hardware and software requires a lot of developer effort to get optimal performance. This often includes restructuring algorithms to make efficient use of the GPU memory. It also often involves experimenting with different approaches to find the best one.

Note also that the effort required to get optimal performance is necessary to justify the use of GPU hardware. The difference between a naive implementation and an optimized implementation can be an order of magnitude or more. This means that an optimized CPU impelemntation will likely be as good or even better than a naive GPU implementation.

People are already working on .NET bindings for CUDA. See here. However, with the necessity of working at a low level, I don't think GPU computing is ready for the masses yet.

I have heard a great deal of talk about turning what today are GPU's into more general-purpose "array proceesor units", for use with any matrix math problem, rather than just graphics processing. I haven't seen much come of it yet though.

The theory was that array processors might follow roughly the same trajectory that float-point processors followed a couple of decades before. Originally floating point processors were expensive add-on options for PC's that not a lot of people bothered to buy. Eventually they became so vital that they were put into the CPU itself.

I'll repeat the answer I gave here.

Long-term I think that the GPU will cease to exist, as general purpose processors evolve to take over those functions. Intel's Larrabee is the first step. History has shown that betting against x86 is a bad idea.

GHC (Haskell) researchers (working for Microsoft Research) are adding support for Nested Data Parallelism directly to a general purpose programming language. The idea is to use multiple cores and/or GPUs on the back end yet expose data parallel arrays as a native type in the language, regardless of the runtime executing the code in parallel (or serial for the single-CPU fallback).

http://www.haskell.org/haskellwiki/GHC/Data_Parallel_Haskell

Depending on the success of this in the next few years, I would expect to see other languages (C# specifically) pick up on the idea, which could bring these sorts of capabilities to a more mainstream audience. Perhaps by that time the CPU-GPU bandwidth and driver issues will be resolved.

GPUs work well in problems where there is a high level of Data Level Parallelism, which essentially means there is a way to partition the data to be processed such that they can all be processed.

GPUs aren't inherently as fast at a clock speed level. In fact I'm relatively sure the clock speed on the shaders (or maybe they have a more GPGPU term for them these days?) is quite slow compared to the ALUs on a modern desktop processor. The thing is, a GPU has an absolutely enormous amount of these shaders, turning the GPU into an a very large SIMD processor. With the amount of shaders on a modern Geforce, for example, it's possible for a GPU to be working on several hundred (thousand?) floating point numbers at once.

So short, a GPU can be amazingly fast for problems where you can partition the data properly and process the partitions independently. It's not so powerful at Task (thread) Level Parallelism.

A big problem with the GPU technology is that while you do have a lot of compute capability in there, getting data into (and out of it) is terrible (performance-wise). And watch carefully for any comparison benchmarks... they often compare gcc (with minimal optimization, no vectorization) on a single processor system to the GPU.

Another big problem with the GPU's is that if you don't CAREFULLY think about how your data is organized, you will suffer a real performance hit internally (in the GPU). This often involves rewriting very simple code into a convoluted pile of rubbish.

I'm very excited about this technology. However, I think that This will only exacerbate the real challenge of large parallel tasks, one of bandwidth. Adding more cores will only increase contention for memory. OpenCL and other GPGPU abstraction libraries don't offer any tools to improve that.

Any high performance computing hardware platform will usually be designed with the bandwidth issue carefully planned into the hardware, balancing throughput, latency, caching and cost. As long as commodity hardware, CPU's and GPU's, are designed in isolation of each other, with optimized bandwidth only to their local memory, it will be very difficult to improve this for the algorithms that need it.

Its true that GPUs can achieve very hi performance numbers in data level parallelism situations, as lots here mentioned. But as i see it, there is no much use to it in user space now. I cant help feeling that all this GPGPU propaganda comes from GPU manufacturers, which just want to find new markets and uses for their products. And thats absolutelly ok. Have you ever wondered why intel/amd didnt include some mini-x86 cores in addition to standard ones (lets say - model with four x86 cores and 64 mini-x86-cores), just to boost data level paralelism capabilties ? They definately could do that, if wanted. My guess is that industry just dont need that kind of processing power in regular desktop/server machines.

GPUs may or may not remain as popular as they are now, but the basic idea is becoming a rather popular approach to high power processing. One trend that is coming up now is the external "accelerator" to aid the CPU with large floating point jobs. A GPU is just one type of accelerator.

Intel is releasing a new accelerator called the Xeon Phi, which they're hoping can challenge the GPU as a HPC accelerator. The Cell processor took a similar approach, having one main CPU for doing general tasks, and offloading compute intensive tasks to some other processing elements, achieving some impressive speeds.

Accelerators in general seem to be of interest at the moment, so they should be around for a while at least. Whether or not the GPU remains as the de facto accelerator remains to be seen.

Your perception that GPUs are faster than CPUs is based on the misconception created by a few embarassingly parallel applications applied to the likes of the PS3, NVIDIA and ATI hardware.

http://en.wikipedia.org/wiki/Embarrassingly_parallel

Most real world challenges are not decomposable easily into these types of tasks. The desktop CPU is way better suited for this type of challenge from both a feature set and performance standpoint.

I expect the same things that CPUs are used for?

I just mean this seems like a gimmick to me. I hesitate to say "that's going nowhere" when it comes to technology but GPUs primary function is graphics rendering and CPUs primary function is all other processing. Having the GPU do anything else just seems whacky.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top