Parallel programming using Haswell architecture [closed]

Question

It sounds to me like you need to learn about parallel programming in general on the CPU. I started looking into this about 10 months ago before I ever used SSE, OpenMP, or intrinsics so let me give a brief summary of some important concepts I have learned and some useful resources.

There are several parallel computing technologies that can be employed: MIMD, SIMD, instruction level parallelism, multi-level cahces, and FMA. With Haswell there is also computing on the IGP.

I recommend picking a topic like matrix multiplication or the Mandelbrot set. They can both benefit from all these technologies.

MIMD

By MIMD I am referring to computing using multiple physical cores. I recommend OpenMP for this. Go through this tutorial http://bisqwit.iki.fi/story/howto/openmp/#Abstract and then use this as a reference https://computing.llnl.gov/tutorials/openMP/. Two of the most common problems using MIMD are race conditions and false sharing. Follow OpenMP on SO reguarly.

SIMD

Many compilers can do auto-vectorization so I would look into that. MSVC's auto-vectorization is quite primitive but GCC's is really good.

Learn intrinsics. The best resource to know what a intrinsic does is http://software.intel.com/sites/landingpage/IntrinsicsGuide/

Another great resource is Agner Fog's vectorclass. 95% of the questions on SO on SSE/AVX can be answered by looking at the source code of the vectorclass. On top of that you could use the vectorclass for most SIMD and still get the full speed and skip intrinsics.

A lot of people use SIMD inefficiently. Read about Array of Structs (AOS) and Struct of Arrays (SOA) and Array of struct of Arrays (AOSOA). Also look into Intel strip mining Calculating matrix product is much slower with SSE than with straight-forward-algorithm

See Ingo Wald's PhD thesis for a interesting way to implement SIMD in ray tracing. I used this same idea for the Mandelbrot set to calculate 4(8) pixels at once using SSE(AVX).

Also read this paper "Extending a C-like Language for Portable SIMD Programming" by Wald http://www.cdl.uni-saarland.de/papers/leissa_vecimp_tr.pdf to get a better idea how to use SIMD.

FMA

FMA3 is new since Haswell. It's so new that there is not much discussion on it on SO yet. But this answer (to my question) is good How to use Fused Multiply-Add (FMA) instructions with SSE/AVX. FMA3 doubles the peak FLOPS so potentially matrix multiplication is twice as fast on Haswell compared to Ivy Bridge.

According to this answer the most important aspect of FMA is not the fact that it's one instructions instead of two to do multiplication and addition it's the "(virtually) infinite precision of the intermediate result." For example implementing double-double multiplication without FMA it takes 6 multiplications and several additions whereas with FMA it's only two operations.

Instruction level parallelism

Haswell has 8 ports which it can send μ-ops to (though not every port can take the same mirco-op; see this AnandTech review). This means Haswell can do, for example two 256-bit loads, one 256-bit store, two 256-bit FMA operations, one scalar addition, and a condition jump at the same time (six μ-ops per clock cycle).

For the most part you don't have to worry about this since it's done by the CPU. However, there are cases where your code can limit the potential instruction level parallelism. The most common is a loop carried dependency. The following code has a loop carried dependency

for(int i=0; i<n; i++) {
    sum += x(i)*y(i);
}

The way to fix this is to unroll the loop and do partial sums

for(int i=0; i<n; i+=2) {
    sum1 += x(i)*y(i);
    sum2 += x(i+1)*y(i+1);
}
sum = sum1 + sum2;

Multi-level Caches:

Haswell has up to four levels of caches. Writing your code to optimally take advantage of the cache is by far the most difficult challenge in my opinion. It's the topic I still struggle the most with and feel the most ignorant about but in many cases improving cache usage gives better performance than any of the other technologies. I don't have many recommendations for this.

You need to learn about sets and cache lines (and the critical stride) and on NUMA systems about pages. To learn a little about sets and the critical stride see Agner Fog's http://www.agner.org/optimize/optimizing_cpp.pdf and this Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

Another very useful topic for the cache is loop blocking or tiling. See my answer (the one with the highest votes) at What is the fastest way to transpose a matrix in C++? for an example.

Computing on the IGP (with Iris Pro).

All Haswell consumer processors (Haswell-E is not out yet) have an IGP. The IGP uses at least 30% of the silicon to over 50%. That's enough for at least 2 more x86 cores. This is wasted computing potential for most programmers. The only way to program the IGP is with OpenCL. Intel does not have OpenCL Iris Pro drivers for Linux so you can only do with with Windows (I'm not sure how good Apple's implementation of this is). Programming Intel IGP (e.g. Iris Pro 5200) hardware without OpenCL.

One advantage of the Iris Pro compared to Nvidia and AMD is that double floating point is only one quarter the speed of single floating point with the Iris Pro (however fp64 is only enabled in Direct Compute and not with OpenCL). NVIDIA and AMD (recently) cripple double floating point so much that it makes GPGPU double floating point computing not very effective on their consumer cards.