To make IPP FFT effective and performant, I had to spin off as many tasks as I have cores per package times number of packages.
With NUMA nodes enabled, another scalability problem had to be addressed by enabling gcServer in the app config file. This seems to ensure that memory is allocated evenly on each of the NUMA nodes.
(With HT enabled...) With Intel TurboBoost enabled, I see less than 50% CPU utilization, often as low as 35%. Once TurboBoost is off, I see 50% CPU load consistently.
It's nice to see that, in .NET 4.5 Task Parallel Library, server-class performance tweaking is externalized. It would be even nicer to get it for free, always.
Details: tested on dual Xeon E5 v1 rig with Server 2k8 R2 SP1 Enterprise.