"Out of thin air" hypothesis, but the difference in performance you observe here seems to be related to CPU caching; your ARM CPU has far less cache than your desktop's i7.
Your float array has two millions elements in it; that makes for a minimum of 8 MB storage. Those 8 MB need to reach the CPU.
I also have an i7 here and the size of caches is: 32kb (L1), 256kb (L2), 6MB (L3); three quarters of the float array can fit into L3! It seems that in your case there can only be 32kb at a time... Therefore there is a lot of cache thrashing and the memory bus traffic is very high.
I suspect that if you reduce your array size to something which fits 32kb (for instance, try with only 1000 floats) the performance figures will be far closer.
EDIT: it also happens that your CPU does not have an FPU; that accounts for the majority of the performance loss, as @Voo mentioned.
So:
- lack of an FPU,
- small cache,
- lots of data.
For a more "realistic" comparison, you should test over a smaller subset of data; this will at least alleviate (but not completely eliminate) the cache problem.