I have written optimized code for an algorithm that computes a vector of quantities. I have timed it before and after various attempts at getting the data computed in the function out of the function. I think that the specific nature of the computation or the nature of the vector of quantities is not relevant. An outline of the code, timings, and details follow.
All code was compiled with the following flags:
g++ -Wall -Wextra -Werror -std=c++11 -pedantic -O3
I have a class like this:
#ifndef C_H
#define C_H
#include <iostream>
#include <iterator>
#include <vector>
Class c {
public:
void doWork( int param1, int param2 ) const {
std::array<unsigned long,40> counts = {{0}};
// LOTS of branches and inexpensive operations:
// additions, subtractions, incrementations, and dereferences
for( /* loop 1 */ ) {
// LOTS MORE branches and inexpensive operations
counts[ /* data dependent */ ] += /* data dependent */;
for( /* loop 2 */ ) {
// YET MORE branches inexpensive operations
counts[ /* data dependent */ ] += /* data dependent */;
}
}
counts [ /* data dependent */ ] = /* data dependent */;
/* exclude for profiling
std::copy( &counts[0], &counts[40], std::ostream_iterator( std::cout, "," ) );
std::cout << "\n";
*/
}
private:
// there is private data here that is processed above
// the results get added into the array/vector as they are computed
};
#endif
And a main like this:
#include <iostream>
#include "c.h"
int main( int argc, char * argv ) {
Class c( //set the private data of c by passing data in );
int param1;
int param2;
while( std::cin >> param1 >> param2 ) {
c.doWork( int param1, int param2 );
}
}
Here are some relevant details about the data:
- 20 million pairs read at standard input (redirected from a file)
- 20 million calls to c.doWork
- 60 million TOTAL iterations through the outer loop in c.doWork
- 180 million TOTAL iterations through the inner loop in c.doWork
All of this requires exactly 5 minutes and 48 seconds to run. Naturally I can print the array within the class function, and that is what I have been doing, but I am going to release the code publicly, and some use cases may include wanting to do something other than printing the vector. In that case, I need to change the function signature to actually get the data to the user. This is where the problem arises. Things that I have tried:
Creating a vector in main and passing it in by reference:
std::vector<unsigned long> counts( 40 );
while( std::cin >> param1 >> param2 ) {
c.doWork( param1, param2, counts );
std::fill( counts.begin(), counts.end(), 0 );
}
This requires 7 minutes 30 seconds. Removing the call to std::fill only reduces this by 15 seconds, so that doesn't account for the discrepancy.
Creating a vector within the doWork function and returning it, taking advantage of move semantics.
Since this requires a dynamic allocation for each result, I didn't expect this to be fast. Strangely it's not a lot slower. 7 minutes 40 seconds.
Returning the std::array currently in doWork by value.
Naturally this has to copy the data upon return since the stack array does not support move semantics. 7 minutes 30 seconds
Passing a std::array in by reference.
while( std::cin >> param1 >> param2 ) {
std::array<unsigned long,40> counts = {{0}};
c.doWork( param1, param2, counts )
}
I would expect this to be roughly equivalent to the original. The data is placed on the stack in the main function, and it is passed by reference to doWork, which fills it. 7 minutes 20 seconds. This one really stymies me.
I have not tried passing pointers in to doWork, because this should be equivalent to passing by reference.
One solution is naturally to have two versions of the function: one that prints locally and one that returns. The roadblock is that I would have to duplicate ALL code, because the entire issue here is that I cannot efficiently get the results out of a function.
So I am mystified. I understand that any of these solutions require an extra dereference for every access to the array/vector inside doWork, but these extra dereferences are highly trivial compared to the huge number of other fast operations and more troublesome data-dependent branches.
I welcome any ideas to explain this. My only thought is that the code is being optimized by the compiler so that some otherwise necessary components of computation are being omitted in the original case, because the compiler realizes that it is not necessary. But this seems to be contraindicated on several counts:
- Making changes to the code inside the loops does change the timings.
- The original timings are 5 minutes 50 seconds, whereas just reading the pairs from the file takes 12 seconds, so a lot is being done.
- Maybe only operations involving counts are being optimized away, but that seems like a strangely selective optimization given that if those are being optimized away the compiler could realize that supporting computations in doWork are also unecessary.
- If operations involving counts ARE being optimized away, why are they not optimized in the other cases. I am not actually using them in main.
Is it the case that doWork is compiled and optimized independently of main, and thus if the function has any obligation to return the data in any form it cannot be certain of whether it will be used or not?
Is my method of profiling without printing, which was to avoid the cost of the printing to emphasize the relative differences in various methods, flawed?
I am grateful for any light you can shed.