Would a C6000 DSP be outperformed by a Cortex A9 for FP

Question 1

That question can't be answered without knowing the clock-rates of DSP and the ARM.

Here is some background:

I just checked the cycles of a floating point multiplication on the c674x DSP:

It can issue two multiplications per cycle, and each multiplication has a result latency of three cycles (that means you have to wait three additional cycles before the result appears in the destination register).

You can however start two multiplications each cycle because the DSP will not wait for the result. The compiler/assembler will do the required scheduling for you.

That only uses two of the available eight functional units of the DSP, so while you do the two multiplications you can per cycle also do:

two load/stores (64 bit wide)
six floating point add/subtract instructions (or integer instructions)

Loop control and branching is free and does not cost you anything on the DSP.

That makes a total of six floating point operations per cycle with parallel loads/stores and loop control.

ARM-NEON on the other hand can, in floating point mode:

Issue two multiplications per cycle. Latency is comparable, and the instructions are also pipeline-able like on the DSP. Loading/storing takes extra time as does add/subtract stuff. Loop control and branching will very likely go for free in well written code.

So in summary the DSP does three times as much work per cycle as the Cortex-A9 NEON unit.

Now you can check the clock-rates of DSP and the ARM and see what is faster for your job.

Oh, one thing: With well-written DSP code you will almost never see a cache miss during loads because you move the data from RAM to the cache using DMA before you access the data. This gives impressive speed advantages for the DSP as well.

Question 2

It does depend on the application but, generally speaking, it is rare these days for special purpose processors to beat general-purpose processors. General purpose processors now have have higher clock rates and multimedia acceleration. Even for a numerically intensive algorithm where a DSP may have an edge, the increased engineering complexity of dealing with a heterogeneous multi-processor environment makes this type of solution problematic from an ROI perspective.