CUDA or FPGA for special purpose 3D graphics computations?

https://stackoverflow.com/questions/317731

11-07-2019
|

Question

I am developing a product with heavy 3D graphics computations, to a large extent closest point and range searches. Some hardware optimization would be useful. While I know little about this, my boss (who has no software experience) advocates FPGA (because it can be tailored), while our junior developer advocates GPGPU with CUDA, because its cheap, hot and open. While I feel I lack judgement in this question, I believe CUDA is the way to go also because I am worried about flexibility, our product is still under strong development.

So, rephrasing the question, are there any reasons to go for FPGA at all? Or is there a third option?

Solution

I investigated the same question a while back. After chatting to people who have worked on FPGAs, this is what I get:

FPGAs are great for realtime systems, where even 1ms of delay might be too long. This does not apply in your case;
FPGAs can be very fast, espeically for well-defined digital signal processing usages (e.g. radar data) but the good ones are much more expensive and specialised than even professional GPGPUs;
FPGAs are quite cumbersome to programme. Since there is a hardware configuration component to compiling, it could take hours. It seems to be more suited to electronic engineers (who are generally the ones who work on FPGAs) than software developers.

If you can make CUDA work for you, it's probably the best option at the moment. It will certainly be more flexible than a FPGA.

Other options include Brook from ATI, but until something big happens, it is simply not as well adopted as CUDA. After that, there's still all the traditional HPC options (clusters of x86/PowerPC/Cell), but they are all quite expensive.

Hope that helps.

OTHER TIPS

We did some comparison between FPGA and CUDA. One thing where CUDA shines if you can realy formulate your problem in a SIMD fashion AND can access the memory coalesced. If the memory accesses are not coalesced(1) or if you have different control flow in different threads the GPU can lose drastically its performance and the FPGA can outperform it. Another thing is when your operation is realtive small, but you have a huge amount of it. But you cant (e.g. due to synchronisation) no start it in a loop in one kernel, then your invocation times for the GPU kernel exceeds the computation time.

Also the power of the FPGA could be better (depends on your application scenarion, ie. the GPU is only cheaper (in terms of Watts/Flop) when its computing all the time).

Offcourse the FPGA has also some drawbacks: IO can be one (we had here an application were we needed 70 GB/s, no problem for GPU, but to get this amount of data into a FPGA you need for conventional design more pins than available). Another drawback is the time and money. A FPGA is much more expensive than the best GPU and the development times are very high.

(1) Simultanously accesses from different thread to memory have to be to sequential addresses. This is sometimes really hard to achieve.

I would go with CUDA.
I work in image processing and have been trying hardware add-ons for years. First we had i860, then Transputer, then DSP, then the FPGA and direct-compiliation-to-hardware.
What innevitably happened was that by the time the hardware boards were really debugged and reliable and the code had been ported to them - regular CPUs had advanced to beat them, or the hosting machine architecture changed and we couldn't use the old boards, or the makers of the board went bust.

By sticking to something like CUDA you aren't tied to one small specialist maker of FPGA boards. The performence of GPUs is improving faster then CPUs and is funded by the gamers. It's a mainstream technology and so will probably merge with multi-core CPUs in the future and so protect your investment.

FPGAs

What you need:
- Learn VHDL/Verilog (and trust me you won't)
- Buy hw for testing , licences on synthesis tools
- If you choose some good framework (for ex. : RSoC)
  - Develop design ( and it can take years )
- If you don't:
  - DMA, hw driver, ultra expensive synthesis tools
  - tons of knowledge about buses, memory mapping, hw synthesis
  - build the hw, buy the ip cores
  - Develop design
For example average FPGA pcie card with chip Xilinx virtex-6 costs more than 3000$
Result:
- If you are not paid by the government you don't have enough funds.

GPGPU (CUDA/OpenCL)

You already have hw to test on.
Compare to FPGA stuff:
- Everything is well documented .
- Everything is cheap
- Everything works
- Everything is well integrated to programming languages
There is GPU cloud as well.
Result:
- You need to just download sdk and you can start.

FPGA-based solution is likely to be way more expensive than CUDA.

Obviously this is a complex question. The question might also include the cell processor. And there is probably not a single answer which is correct for other related questions.

In my experience, any implementation done in abstract fashion, i.e. compiled high level language vs. machine level implementation, will inevitably have a performance cost, esp in a complex algorithm implementation. This is true of both FPGA's and processors of any type. An FPGA designed specifically to implement a complex algorithm will perform better than an FPGA whose processing elements are generic, allowing it a degree of programmability from input control registers, data i/o etc.

Another general example where an FPGA can be much higher performance is in cascaded processes where on process outputs become the inputs to another and they cannot be done concurrently. Cascading processes in an FPGA is simple, and can dramatically lower memory I/O requirements while processor memory will be used to effectively cascade two or more processes where there are data dependencies.

The same can be said of a GPU and CPU. Algorithms implemented in C executing on a CPU developed without regard to the inherent performance characteristics of the cache memory or main memory system will not perform as well as one implemented which does. Granted, not considering these performance characteristics simplifies implementation. But at a performance cost.

Having no direct experience with a GPU, but knowing its inherent memory system performance issues, it too will be subjected to performance issues.

This is an old thread started in 2008, but it would be good to recount what happened to FPGA programming since then: 1. C to gates in FPGA is the mainstream development for many companies with HUGE time saving vs. Verilog/SystemVerilog HDL. In C to gates System level design is the hard part. 2. OpenCL on FPGA is there for 4+ years including floating point and "cloud" deployment by Microsoft (Asure) and Amazon F1 (Ryft API). With OpenCL system design is relatively easy because of very well defined memory model and API between host and compute devices.

Software folks just need to learn a bit about FPGA architecture to be able to do things that are NOT EVEN POSSIBLE with GPUs and CPUs for the reasons of both being fixed silicon and not having broadband (100Gb+) interfaces to the outside world. Scaling down chip geometry is no longer possible, nor extracting more heat from the single chip package without melting it, so this looks like the end of the road for single package chips. My thesis here is that the future belongs to parallel programming of multi-chip systems, and FPGAs have a great chance to be ahead of the game. Check out http://isfpga.org/ if you have concerns about performance, etc.

CUDA has a fairly substantial code base of examples and a SDK, including a BLAS back-end. Try to find some examples similar to what you are doing, perhaps also looking at the GPU Gems series of books, to gauge how well CUDA will fit your applications. I'd say from a logistic point of view, CUDA is easier to work with and much, much cheaper than any professional FPGA development toolkit.

At one point I did look into CUDA for claim reserve simulation modelling. There is quite a good series of lectures linked off the web-site for learning. On Windows, you need to make sure CUDA is running on a card with no displays as the graphics subsystem has a watchdog timer that will nuke any process running for more than 5 seconds. This does not occur on Linux.

Any mahcine with two PCI-e x16 slots should support this. I used a HP XW9300, which you can pick up off ebay quite cheaply. If you do, make sure it has two CPU's (not one dual-core CPU) as the PCI-e slots live on separate Hypertransport buses and you need two CPU's in the machine to have both buses active.

I'm a CUDA developer with very littel experience with FPGA:s, however I've been trying to find comparisons between the two.

What I've concluded so far:

The GPU has by far higher ( accessible ) peak performance It has a more favorable FLOP/watt ratio. It is cheaper It is developing faster (quite soon you will literally have a "real" TFLOP available). It is easier to program ( read article on this not personal opinion)

Note that I'm saying real/accessible to distinguish from the numbers you will see in a GPGPU commercial.

BUT the gpu is not more favorable when you need to do random accesses to data. This will hopefully change with the new Nvidia Fermi architecture which has an optional l1/l2 cache.

my 2 cents

FPGA will not be favoured by those with a software bias as they need to learn an HDL or at least understand systemC.

For those with a hardware bias FPGA will be the first option considered.

In reality a firm grasp of both is required & then an objective decision can be made.

OpenCL is designed to run on both FPGA & GPU, even CUDA can be ported to FPGA.

FPGA & GPU accelerators can be used together

So it's not a case of what is better one or the other. There is also the debate about CUDA vs OpenCL

Again unless you have optimized & benchmarked both to your specific application you can not know with 100% certainty.

Many will simply go with CUDA because of its commercial nature & resources. Others will go with openCL because of its versatility.

What are you deploying on? Who is your customer? Without even know the answers to these questions, I would not use an FPGA unless you are building a real-time system and have electrical/computer engineers on your team that have knowledge of hardware description languages such as VHDL and Verilog. There's a lot to it and it takes a different frame of mind than conventional programming.

FPGAs have fallen out of favor in the HPC sector because they're a horrorterror to program. CUDA is in because it is much much nicer to program and will still give you some good performance. I would go with what the HPC community has gone with and do it in CUDA. It's easier, it's cheaper, it's more maintainable.

Others have given good answers, just wanted to add a different perspective. Here is my survey paper published in ACM Computing Surveys 2015 (its permalink is here), which compares GPU with FPGA and CPU on energy efficiency metric. Most papers report: FPGA is more energy efficient than GPU, which, in turn, is more energy efficient than CPU. Since power budgets are fixed (depending on cooling capability), energy efficiency of FPGA means one can do more computations within same power budget with FPGA, and thus get better performance with FPGA than with GPU. Of course, also account for FPGA limitations, as mentioned by others.

FPGAs are more parallel than GPUs, by three orders of magnitude. While good GPU features thousands of cores, FPGA may have millions of programmable gates.
While CUDA cores must do highly similar computations to be productive, FPGA cells are truly independent from each other.
FPGA can be very fast with some groups of tasks and are often used where a millisecond is already seen as a long duration.
GPU core is way more powerful than FPGA cell, and much easier to program. It is a core, can divide and multiply no problem when FPGA cell is only capable of rather simple boolean logic.
As GPU core is a core, it is efficient to program it in C++. Even it it is also possible to program FPGA in C++, it is inefficient (just "productive"). Specialized languages like VDHL or Verilog must be used - they are difficult and challenging to master.
Most of the true and tried instincts of a software engineer are useless with FPGA. You want a for loop with these gates? Which galaxy are you from? You need to change into the mindset of electronics engineer to understand this world.

at latest GTC'13 many HPC people agreed that CUDA is here to stay. FGPA's are cumbersome, CUDA is getting quite more mature supporting Python/C/C++/ARM.. either way, that was a dated question

Programming a GPU in CUDA is definitely easier. If you don't have any experience with programming FPGAs in HDL it will almost surely be too much of a challenge for you, but you can still program them with OpenCL which is kinda similar to CUDA. However, it is harder to implement and probably a lot more expensive than programming GPUs.

Which one is Faster?

GPU runs faster, but FPGA can be more efficient.

GPU has the potential of running at a speed higher than FPGA can ever reach. But only for algorithms that are specially suited for that. If the algorithm is not optimal, the GPU will loose a lot of performance.

FPGA on the other hand runs much slower, but you can implement problem-specific hardware that will be very efficient and get stuff done in less time.

It's kinda like eating your soup with a fork very fast vs. eating it with a spoon more slowly.

Both devices base their performance on parallelization, but each in a slightly different way. If the algorithm can be granulated into a lot of pieces that execute the same operations (keyword: SIMD), the GPU will be faster. If the algorithm can be implemented as a long pipeline, the FPGA will be faster. Also, if you want to use floating point, FPGA will not be very happy with it :)

I have dedicated my whole master's thesis to this topic. Algorithm Acceleration on FPGA with OpenCL

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow