I'm experimenting with the C++ AMP library in F# as a way of using the GPU to do work in parallel. However, the results I'm getting don't seem intuitive.
In C++, I made a library with one function that squares all the numbers in an array, using AMP:
extern "C" __declspec ( dllexport ) void _stdcall square_array(double* arr, int n)
{
// Create a view over the data on the CPU
array_view<double,1> dataView(n, &arr[0]);
// Run code on the GPU
parallel_for_each(dataView.extent, [=] (index<1> idx) restrict(amp)
{
dataView[idx] = dataView[idx] * dataView[idx];
});
// Copy data from GPU to CPU
dataView.synchronize();
}
(Code adapted from Igor Ostrovsky's blog on MSDN.)
I then wrote the following F# to compare the Task Parallel Library (TPL) to AMP:
// Print the time needed to run the given function
let time f =
let s = new Stopwatch()
s.Start()
f ()
s.Stop()
printfn "elapsed: %d" s.ElapsedTicks
module CInterop =
[<DllImport("CPlus", CallingConvention = CallingConvention.StdCall)>]
extern void square_array(float[] array, int length)
let options = new ParallelOptions()
let size = 1000.0
let arr = [|1.0 .. size|]
// Square the number at the given index of the array
let sq i =
do arr.[i] <- arr.[i] * arr.[i]
()
// Square every number in the array using TPL
time (fun() -> Parallel.For(0, arr.Length - 1, options, new Action<int>(sq)) |> ignore)
let arr2 = [|1.0 .. size|]
// Square every number in the array using AMP
time (fun() -> CInterop.square_array(arr2, arr2.Length))
If I set the array size to a trivial number like 10, it takes the TPL ~22K ticks to finish, and AMP ~10K ticks. That's what I expect. As I understand it, a GPU (hence AMP) should be better suited to this situation, where the work is broken into very small pieces, than the TPL.
However, if I increase the array size to 1000, the TPL now takes ~30K ticks and AMP takes ~70K ticks. And it just gets worse from there. For an array of size 1 million, AMP takes nearly 1000x as long as the TPL.
Since I expect the GPU (i.e. AMP) to be better at this kind of task, I'm wondering what I'm missing here.
My graphics card is a GeForce 550 Ti with 1GB, not a slouch as far as I know. I know there's overhead in using PInvoke to call into the AMP code, but I expect that to be a flat cost that is amortized over larger array sizes. I believe the array is passed by reference (though I could be wrong), so I don't expect any cost associated with copying that.
Thank you to everyone for your advice.