Fast Cross Product. Function call overhead?
Question
I'm newbie in C++ programming. I'm trying to see the benefits from moving all my MatLab software to C++. I'm doing some finite element stuff, mainly nonlinear, so one of the operations I need to perform massively is the cross product of two vectors. I've tested two implementations in Matlab and C++, C++ seems to be much more faster. In C++ two different implementations give different timings. I'm using Intel MKL.
Here is the code:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <iostream>
#include <mkl.h>
void vprod( double vgr[3], double vg1[3], double vg2[3]);
int main() {
double v1[3]={1.22, 2.65, 3.65}, v2[3]={6.98, 98.159, 54.65}, vr[3];
int LC=1000000;
int i,j,k;
double tiempo=0.0, tinicial;
//------------------------------------------------------------------------
std::cout << "INLINE METHOD: " << std::endl;
tinicial = dsecnd();
for (i=0; i<LC; i++){
vr[0] = v1[1]*v2[2]-v1[2]*v2[1];
vr[1] =-(v1[0]*v2[2]-v1[2]*v2[0]);
vr[2] = v1[0]*v2[1]-v1[1]*v2[0];
};
tiempo = (dsecnd() - tinicial);
std::cout << "Tiempo Total: " << tiempo << std::endl;
std::cout << "Resultado: " << vr[0] << std::endl;
//------------------------------------------------------------------------
//------------------------------------------------------------------------
std::cout << "FUNCTION METHOD: " << std::endl;
tinicial = dsecnd();
for (i=0; i<LC; i++){
vprod (vr,v1,v2);
};
tiempo = (dsecnd() - tinicial);
std::cout << "Tiempo Total: " << tiempo << std::endl;
std::cout << "Resultado: " << vr[0] << std::endl;
//------------------------------------------------------------------------
std::cin.ignore();
return 0;
}
inline void vprod( double vgr[3], double vg1[3], double vg2[3]){
vgr[0] = vg1[1]*vg2[2]-vg1[2]*vg2[1];
vgr[1] =-(vg1[0]*vg2[2]-vg1[2]*vg2[0]);
vgr[2] = vg1[0]*vg2[1]-vg1[1]*vg2[0];
}
My question is: Why the first implementation is 3 times faster than the second? Is this the result of function call overhead? Thanks !!!
EDIT: I've modified the code in order to avoid the compiler "guessing" the results for the loop with constant vectors. As @phonetagger showed, the results are very different. I've got 28500 microseconds without using the vprod function and 29000 microseconds using the vprod
function. This number were obtained using Ox optimization. Changing the optimization doesn't affect the comparison if the inline keyword is on, although the numbers raise a bit. Also, if the inline keyword is not used (and optimization is off) the timings are 32000 without using the vprod function and 37000 using the function. So the function call overhead may be around 5000 microseconds.
The new code is:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <iostream>
#include <mkl.h>
//#include <mkl_lapack.h>
void vprod( double *vgr, int ploc, double *vg1, double *vg2);
int main() {
int nv=1000000;
int dim=3*nv;
double *v1, *v2, *vr; // Declare Pointers
int ploc, i;
double tiempo=0.0, tinicial;
v1 = new double [dim]; //Allocate block of memory
v2 = new double [dim];
vr = new double [dim];
// Fill vectors with something
for (i = 0; i < dim; i++) {
v1[i] =1.25 + (double)(i+1);
v2[i] =2.62+ 2*(double)(i+7);
}
//------------------------------------------------------------------------
std::cout << "RUTINA CON CODIGO INLINE: \n" ;
tinicial = dsecnd();
ploc = 0; // ploc points to an intermediate location.
for (i=0; i<nv; i++){
vr[ploc] = v1[ploc+1]*v2[ploc+2]-v1[ploc+2]*v2[ploc+1];
vr[ploc+1] =-(v1[ploc]*v2[ploc+2]-v1[ploc+2]*v2[ploc]);
vr[ploc+2] = v1[ploc]*v2[ploc+1]-v1[ploc+1]*v2[ploc];
ploc +=3;
};
tiempo = (dsecnd() - tinicial);
std::cout << "Tiempo Total: " << tiempo << ".\n";
std::cout << "Resultado: " << vr[0] << ".\n";
delete v1,v2,vr;
v1 = new double [dim]; //Allocate block of memory
v2 = new double [dim];
vr = new double [dim];
//------------------------------------------------------------------------
//------------------------------------------------------------------------
std::cout << "RUTINA LLAMANDO A FUNCION: \n" ;
ploc=0;
tinicial = dsecnd();
for (i=0; i<nv; i++){
vprod ( vr, ploc, v1, v2);
ploc +=3;
};
tiempo = (dsecnd() - tinicial);
std::cout << "Tiempo Total: " << tiempo << ".\n";
std::cout << "Resultado: " << vr[0] << ".\n";
//------------------------------------------------------------------------
std::cin.ignore();
return 0;
}
inline void vprod( double *vgr, int ploc, double *vg1, double *vg2) {
vgr[ploc] = vg1[ploc+1]*vg2[ploc+2]-vg1[ploc+2]*vg2[ploc+1];
vgr[ploc+1] = -(vg1[ploc]*vg2[ploc+2]-vg1[ploc+2]*vg2[ploc]);
vgr[ploc+2] = vg1[ploc]*vg2[ploc+1]-vg1[ploc+1]*vg2[ploc];
}
Solution
Martin, you are absolutely right (ref. Martin's comment... 3rd comment under my 17:57 Oct 5 2012 answer). Yes, it appears that at higher optimization levels, the compiler was allowing itself to realize that it knew the incoming values of your arrays so it could perform the entire computation, loop and all, at compile time, and optimize the loop out entirely.
I re-coded the test code into three separate files (one header & two source files) and broke the computation & loop out into a separate function to keep the compiler from being too smart with its optimizations. Now it can't optimize the loops into a compile-time computation. Below are my new results. Note that I added another loop (0 to 50) around the original 0 to 1000000 loop, and then divided by 50. I did this for two reasons: It allows us to compare today's numbers with the previous numbers, and it also averages out irregularities due to processes swapping in the middle of the test. That may not matter to you since I think dsecnd() reports only CPU time of its specific process?
Anyway, here are my new results.......
(And yes, the odd result of "inline keyword, optimization -O1" being faster than -O2 or -O3 is repeatable, as is the oddity of "no inline keyword, optimization -O1". I didn't dig into the assembly to see why that might be.)
//========================================================================================
// File: so.h
void loop_inline( const int LC, double vgr[3], double vg1[3], double vg2[3]);
void loop_func( const int LC, double vgr[3], double vg1[3], double vg2[3]);
//---------------------------------
// Comment or uncomment to test both ways...
#define USE_INLINE_KEYWORD
//
// Using g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52) on an x86 machine...
//
// microseconds microseconds
// "hardcoded inline" "via vprod() function"
// [i]=inlined, [-]=not
// ------------------ ----------------------
// inline keyword
// no optimization 11734 14598 [-]
// optimization -O1 4617 4616 [i]
// optimization -O2 7754 7838 [i]
// optimization -O3 7777 7673 [i]
//
// no inline keyword
// no optimization 11807 14602 [-]
// optimization -O1 4651 7691 [-]
// optimization -O2 7755 7383 [-]
// optimization -O3 7921 7432 [-]
//
// Note that in all cases, both results were reported as -213.458.
//
/* My cut & paste "build & test script" to run on the Linux command prompt...
echo ""; echo ""; echo ""; echo ""; echo ""; echo ""; echo ""; echo ""; echo ""
rm -f a.out; g++ -c so.cpp so2.cpp; g++ so.o so2.o;
echo ""; echo "No optimization:---------------"; objdump -d a.out | grep call | grep vprod; a.out
rm -f a.out; g++ -O1 -c so.cpp so2.cpp; g++ so.o so2.o;
echo ""; echo "Optimization -O1:---------------"; objdump -d a.out | grep call | grep vprod; a.out
rm -f a.out; g++ -O2 -c so.cpp so2.cpp; g++ so.o so2.o;
echo ""; echo "Optimization -O2:---------------"; objdump -d a.out | grep call | grep vprod; a.out
rm -f a.out; g++ -O3 -c so.cpp so2.cpp; g++ so.o so2.o;
echo ""; echo "Optimization -O3:---------------"; objdump -d a.out | grep call | grep vprod; a.out
...if the "objdump -d a.out | grep call | grep vprod" command returns something
like "call 8048754 <_Z5vprodPdS_S_>", then I know that the call to vprod() is
NOT inlined, whereas if it returns nothing, I know the call WAS inlined.
*/
//========================================================================================
// File: so.cpp
// Sorry so messy, I didn't bother to clean up the #includes.......
#include <stdint.h>
#include <inttypes.h>
#include <stddef.h> // for NULL
#include <stdlib.h> // for exit()
#include <stdio.h>
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <iostream>
//#include <mkl.h>
#include "so.h"
// My standin for dsecnd() since I don't have "mkl.h"...
#include <sys/time.h>
double dsecnd()
{
struct timeval tv;
if (gettimeofday(&tv,NULL))
{
fprintf(stderr,"\ngettimeofday() error\n\n");
exit(1);
}
return tv.tv_sec*1000000 + tv.tv_usec; // ...returns MICROSECONDS
//return tv.tv_sec + ((double)tv.tv_usec)/1000000; // ...returns SECONDS
}
//---------------------------------
#ifndef USE_INLINE_KEYWORD
// We're NOT using the 'inline' keyword, so define vprod() in this
// file so it can't possibly be inlined where it's called (in the
// other source file).
void vprod( double vgr[3], double vg1[3], double vg2[3]){
//void vprod( double *vgr, double *vg1, double *vg2){
vgr[0] = vg1[1]*vg2[2]-vg1[2]*vg2[1];
vgr[1] =-(vg1[0]*vg2[2]-vg1[2]*vg2[0]);
vgr[2] = vg1[0]*vg2[1]-vg1[1]*vg2[0];
}
#endif
int main() {
double v1[3]={1.22, 2.65, 3.65}, v2[3]={6.98, 98.159, 54.65}, vr[3];
int LC=1000000L;
int i, N=100;
double tiempo=0.0, tinicial;
//------------------------------------------------------------------------
std::cout << "INLINE METHOD: " << std::endl;
tinicial = dsecnd();
for (i=0; i<N; ++i)
loop_inline(LC,vr,v1,v2);
tiempo = (dsecnd() - tinicial)/N;
std::cout << "Tiempo Total: " << tiempo << std::endl;
std::cout << "Resultado: " << vr[0] << std::endl;
//------------------------------------------------------------------------
//------------------------------------------------------------------------
std::cout << "FUNCTION METHOD: " << std::endl;
tinicial = dsecnd();
for (i=0; i<N; ++i)
loop_func(LC,vr,v1,v2);
tiempo = (dsecnd() - tinicial)/N;
std::cout << "Tiempo Total: " << tiempo << std::endl;
std::cout << "Resultado: " << vr[0] << std::endl;
//------------------------------------------------------------------------
// std::cin.ignore();
return 0;
}
//========================================================================================
// File: so2.cpp
#include "so.h"
#ifdef USE_INLINE_KEYWORD
inline void vprod( double vgr[3], double vg1[3], double vg2[3]){
//void vprod( double *vgr, double *vg1, double *vg2){
vgr[0] = vg1[1]*vg2[2]-vg1[2]*vg2[1];
vgr[1] =-(vg1[0]*vg2[2]-vg1[2]*vg2[0]);
vgr[2] = vg1[0]*vg2[1]-vg1[1]*vg2[0];
}
#else
// Not using 'inline' keyword, so just declare (prototype) the
// function here and define it in the other source file (so it
// can't possibly be inlined).
void vprod( double vgr[3], double vg1[3], double vg2[3]);
#endif
void loop_inline( const int LC, double vgr[3], double vg1[3], double vg2[3]){
for (int i=0; i<LC; i++) {
vgr[0] = vg1[1]*vg2[2]-vg1[2]*vg2[1];
vgr[1] =-(vg1[0]*vg2[2]-vg1[2]*vg2[0]);
vgr[2] = vg1[0]*vg2[1]-vg1[1]*vg2[0];
}
}
void loop_func( const int LC, double vgr[3], double vg1[3], double vg2[3]){
for (int i=0; i<LC; i++) {
vprod (vgr,vg1,vg2);
}
}
OTHER TIPS
I don't know what compiler you're using (is "MKL" a compiler suite?), but regardless of what compiler you're using, optimization levels are going to have a dramatic impact on your code's performance, sometimes multiple orders of magnitude, depending on your coding style & whether or not you try to "play tricks" to make your code run faster. Often (though not always) it's better to let the compiler play the tricks for you, and you just focus on writing efficient algorithms rather than play coding tricks.
In any case, I ran your code on my system in various ways, with the results shown in the code comments below...
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <iostream>
//#include <mkl.h>
// My standin for dsecnd() since I don't have "mkl.h"...
#include <sys/time.h>
double dsecnd()
{
struct timeval tv;
if (gettimeofday(&tv,NULL))
{
fprintf(stderr,"\ngettimeofday() error\n\n");
exit(1);
}
return tv.tv_sec*1000000 + tv.tv_usec; // ...returns MICROSECONDS
//return tv.tv_sec + ((double)tv.tv_usec)/1000000; // ...returns SECONDS
}
//---------------------------------
// Uncomment one or both of these to test variations....
//#define USE_INLINE_KEYWORD
//#define DEFINE_vprod_AT_TOP
//
// Using g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52) on an x86 machine...
//
// microseconds microseconds
// "hardcoded inline" "via vprod() function"
// [i]=inlined, [-]=not
// ------------------ ----------------------
// inline keyword, at top
// no optimization 9501 17797 [-]
// optimization -O1 2 (see NOTE) 1 [i]
// optimization -O2 1 1 [i]
// optimization -O3 0 0 [i]
//
// no inline keyword, at top
// no optimization 9630 18203 [-]
// optimization -O1 1257 10681 [-]
// optimization -O2 1272 10694 [-]
// optimization -O3 0 1 [i]
//
// inline keyword, at bottom
// no optimization 9763 18333 [-]
// optimization -O1 1 0 [i]
// optimization -O2 2 1 [i]
// optimization -O3 0 0 [i]
//
// no inline keyword, at bottom
// no optimization 9900 18387 [-]
// optimization -O1 1289 10714 [-]
// optimization -O2 795 6740 [-]
// optimization -O3 1 0 [i]
//
// Note that in all cases, both results were reported as -213.458.
//
// NOTE: Especially since I'm using gettimeofday() instead of something
// that returns process (CPU) time, all results may include some
// time that the CPU spent processing other stuff, but even if
// that weren't the case (i.e. even if I used a function that
// returned only CPU time spent on this particular process), there
// would still be the quantization error of +/-1 microsecond on
// each end of the interval, meaning +/-2 microseconds overall.
//
/* My cut & paste "build & test script" to run on the Linux command prompt...
echo ""; echo ""; echo ""; echo ""; echo ""; echo ""; echo ""; echo ""; echo ""
rm -f a.out; g++ so.cpp
echo ""; echo "No optimization:---------------"; objdump -d a.out | grep call | grep vprod; a.out
rm -f a.out; g++ -O1 so.cpp
echo ""; echo "Optimization -O1:---------------"; objdump -d a.out | grep call | grep vprod; a.out
rm -f a.out; g++ -O2 so.cpp
echo ""; echo "Optimization -O2:---------------"; objdump -d a.out | grep call | grep vprod; a.out
rm -f a.out; g++ -O3 so.cpp
echo ""; echo "Optimization -O3:---------------"; objdump -d a.out | grep call | grep vprod; a.out
...if the "objdump -d a.out | grep call | grep vprod" command returns something
like "call 8048754 <_Z5vprodPdS_S_>", then I know that the call to vprod() is
NOT inlined, whereas if it returns nothing, I know the call WAS inlined. There
is only one caller of vprod(), so the results can't be confusing.
*/
//
//---------------------------------
#ifdef DEFINE_vprod_AT_TOP
#ifdef USE_INLINE_KEYWORD
inline
#endif
void vprod( double vgr[3], double vg1[3], double vg2[3]){
//void vprod( double *vgr, double *vg1, double *vg2){
vgr[0] = vg1[1]*vg2[2]-vg1[2]*vg2[1];
vgr[1] =-(vg1[0]*vg2[2]-vg1[2]*vg2[0]);
vgr[2] = vg1[0]*vg2[1]-vg1[1]*vg2[0];
}
#else
// Declare (prototype) the function only if NOT defining it at the top...
void vprod( double vgr[3], double vg1[3], double vg2[3]);
#endif
int main() {
double v1[3]={1.22, 2.65, 3.65}, v2[3]={6.98, 98.159, 54.65}, vr[3];
int LC=1000000L;
int i,j,k;
double tiempo=0.0, tinicial;
//------------------------------------------------------------------------
std::cout << "INLINE METHOD: " << std::endl;
tinicial = dsecnd();
for (i=0; i<LC; i++){
vr[0] = v1[1]*v2[2]-v1[2]*v2[1];
vr[1] =-(v1[0]*v2[2]-v1[2]*v2[0]);
vr[2] = v1[0]*v2[1]-v1[1]*v2[0];
};
tiempo = (dsecnd() - tinicial);
std::cout << "Tiempo Total: " << tiempo << std::endl;
std::cout << "Resultado: " << vr[0] << std::endl;
//------------------------------------------------------------------------
//------------------------------------------------------------------------
std::cout << "FUNCTION METHOD: " << std::endl;
tinicial = dsecnd();
for (i=0; i<LC; i++){
vprod (vr,v1,v2);
};
tiempo = (dsecnd() - tinicial);
std::cout << "Tiempo Total: " << tiempo << std::endl;
std::cout << "Resultado: " << vr[0] << std::endl;
//------------------------------------------------------------------------
// std::cin.ignore();
return 0;
}
#ifndef DEFINE_vprod_AT_TOP
#ifdef USE_INLINE_KEYWORD
inline
#endif
void vprod( double vgr[3], double vg1[3], double vg2[3]){
//void vprod( double *vgr, double *vg1, double *vg2){
vgr[0] = vg1[1]*vg2[2]-vg1[2]*vg2[1];
vgr[1] =-(vg1[0]*vg2[2]-vg1[2]*vg2[0]);
vgr[2] = vg1[0]*vg2[1]-vg1[1]*vg2[0];
}
#endif
Now the coding tricks the compiler uses don't come in linear fashion with increasing levels of optimization; the tricks the compiler plays turn on at different optimization levels and may depend on whether you use the "inline" keyword. There may be (and my results indicate that there are) different types of optimizations the compiler may employ other than inlining a function. It is interesting to note that, as I've read, the "inline" keyword is really just a suggestion to the compiler that you want a function inlined, and probably just adjusts some threshold for determining whether to inline a function it might have inlined anyway if optimization is turned on. It appears that with optimization turned off, the function was never inlined at all even if the "inline" keyword is used. It is also interesting to note that whether prod() is defined above main() or below main() seems to make no difference in whether the function is inlined or not.