How to optimize a C++ program with intel compiler on AMD chips

https://stackoverflow.com/questions/16431886

14-04-2022
|

Pregunta

Newbie here. I have a big finite analysis code that needs to be run with high performance computing. People keep telling me Intel compiler usually gives better speed (I used to use gcc before). And I found that it is true on our Intel clusters. But recently we have a new AMD cluster. I am confused about how to use the compiling options of icpc to optimize the program.

Basically, I have two questions:

Questions 1

Here is the cluster with AMD chips:

processor       : 63
vendor_id       : AuthenticAMD
cpu family      : 21
model           : 2
model name      : AMD Opteron(tm) Processor 6378                 
stepping        : 0
cpu MHz         : 2399.837
cache size      : 2048 KB
physical id     : 2
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 79
initial apicid  : 79
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr tbm topoext perfctr_core cpb npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
bogomips        : 4799.73
TLB size        : 1536 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate [9] [10]

When I compile a small code using icpc hello.cpp -O3 -xP, I do not know exactly what options should I use? I found the errors are:

$ /usr/bin/time -p ./a.out
Fatal Error: This program was not built to run on the processor in your system.
The allowed processors are: Intel(R) Pentium(R) 4 and compatible Intel processors with Intel(R) Streaming SIMD Extensions 3 (Intel(R) SSE3) instruction support.

real 0.00
user 0.00
sys 0.00

Question-2

If I want the binaries be used for both Intel chips cluster and AMD chips cluster, should I use different options to compile the code?

Solución

Intel compilers don't always work with AMD chips especially with certain flags like -xP (now -xSSE3, see here). Specifically -xSSE3/-xP tells the compiler: May generate Intel® SSE3, SSE2, and SSE instructions for Intel® processors. Optimizes for Intel processors that support Intel® SSE3 instructions. For OS X* systems, this value is only supported on IA-32 architecture. This replaces value P, which is deprecated.

That document also has this quote: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

You can try to optimize with icc and icpc, but I'm not sure it will work on AMD chips. For compilers other than gcc you could look at clang, PGI, or Cray compilers (if you are on a Cray system).

If you're trying to create binaries for both architectures, I'm not sure you'll be able to do heavy optimizations due to differences in cache line size and other architecture specific settings.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow