Question

I have written optimized code for an algorithm that computes a vector of quantities. I have timed it before and after various attempts at getting the data computed in the function out of the function. I think that the specific nature of the computation or the nature of the vector of quantities is not relevant. An outline of the code, timings, and details follow.

All code was compiled with the following flags:

g++ -Wall -Wextra -Werror -std=c++11 -pedantic -O3

I have a class like this:

#ifndef C_H
#define C_H

#include <iostream>
#include <iterator>
#include <vector>
Class c {
    public:
        void doWork( int param1, int param2 ) const {
            std::array<unsigned long,40> counts = {{0}};
            // LOTS of branches and inexpensive operations:
            // additions, subtractions, incrementations, and dereferences
            for( /* loop 1 */ ) {
                // LOTS MORE branches and inexpensive operations
                counts[ /* data dependent */ ] += /* data dependent */;
                for( /* loop 2 */ ) {
                    // YET MORE branches inexpensive operations
                    counts[ /* data dependent */ ] += /* data dependent */;
                }
            }
            counts [ /* data dependent */ ] = /* data dependent */;
            /* exclude for profiling
            std::copy( &counts[0], &counts[40], std::ostream_iterator( std::cout, "," ) );
            std::cout << "\n";
            */
        }
    private:
        // there is private data here that is processed above
        // the results get added into the array/vector as they are computed
};

#endif

And a main like this:

#include <iostream>
#include "c.h"
int main( int argc, char * argv ) {
    Class c( //set the private data of c by passing data in );
    int param1;
    int param2;
    while( std::cin >> param1 >> param2 ) {
        c.doWork( int param1, int param2 );
    }
}

Here are some relevant details about the data:

  • 20 million pairs read at standard input (redirected from a file)
  • 20 million calls to c.doWork
  • 60 million TOTAL iterations through the outer loop in c.doWork
  • 180 million TOTAL iterations through the inner loop in c.doWork

All of this requires exactly 5 minutes and 48 seconds to run. Naturally I can print the array within the class function, and that is what I have been doing, but I am going to release the code publicly, and some use cases may include wanting to do something other than printing the vector. In that case, I need to change the function signature to actually get the data to the user. This is where the problem arises. Things that I have tried:

  • Creating a vector in main and passing it in by reference:

    std::vector<unsigned long> counts( 40 );
    while( std::cin >> param1 >> param2 ) {
        c.doWork( param1, param2, counts );
        std::fill( counts.begin(), counts.end(), 0 );
    }
    

    This requires 7 minutes 30 seconds. Removing the call to std::fill only reduces this by 15 seconds, so that doesn't account for the discrepancy.

  • Creating a vector within the doWork function and returning it, taking advantage of move semantics. Since this requires a dynamic allocation for each result, I didn't expect this to be fast. Strangely it's not a lot slower. 7 minutes 40 seconds.

  • Returning the std::array currently in doWork by value. Naturally this has to copy the data upon return since the stack array does not support move semantics. 7 minutes 30 seconds

  • Passing a std::array in by reference.

    while( std::cin >> param1 >> param2 ) {
        std::array<unsigned long,40> counts = {{0}};
        c.doWork( param1, param2, counts )
    }
    

    I would expect this to be roughly equivalent to the original. The data is placed on the stack in the main function, and it is passed by reference to doWork, which fills it. 7 minutes 20 seconds. This one really stymies me.

I have not tried passing pointers in to doWork, because this should be equivalent to passing by reference.

One solution is naturally to have two versions of the function: one that prints locally and one that returns. The roadblock is that I would have to duplicate ALL code, because the entire issue here is that I cannot efficiently get the results out of a function.

So I am mystified. I understand that any of these solutions require an extra dereference for every access to the array/vector inside doWork, but these extra dereferences are highly trivial compared to the huge number of other fast operations and more troublesome data-dependent branches.

I welcome any ideas to explain this. My only thought is that the code is being optimized by the compiler so that some otherwise necessary components of computation are being omitted in the original case, because the compiler realizes that it is not necessary. But this seems to be contraindicated on several counts:

  1. Making changes to the code inside the loops does change the timings.
  2. The original timings are 5 minutes 50 seconds, whereas just reading the pairs from the file takes 12 seconds, so a lot is being done.
  3. Maybe only operations involving counts are being optimized away, but that seems like a strangely selective optimization given that if those are being optimized away the compiler could realize that supporting computations in doWork are also unecessary.
  4. If operations involving counts ARE being optimized away, why are they not optimized in the other cases. I am not actually using them in main.

Is it the case that doWork is compiled and optimized independently of main, and thus if the function has any obligation to return the data in any form it cannot be certain of whether it will be used or not?

Is my method of profiling without printing, which was to avoid the cost of the printing to emphasize the relative differences in various methods, flawed?

I am grateful for any light you can shed.

Was it helpful?

Solution

What I would do is pause it a few times and see what it's doing most of the time. Looking at your code, I would suspect the most time going into either a) the innermost loop, especially the index calculation, or 2) the allocation of the std::array.

If the size of counts is always 40, I would just do

  long counts[40];
  memset(counts, 0, sizeof(counts));

That allocates on the stack, which takes no time, and memset takes no time compared to whatever else you're doing.

If the size varies at runtime, then what I do is some static allocation, like this:

void myRoutine(){
  /* this does not claim to be pretty. it claims to be efficient */
  static int nAlloc = 0;
  static long* counts = NULL;
  /* this initially allocates the array, and makes it bigger if necessary */
  if (nAlloc < /* size I need */){
    if (counts) delete counts;
    nAlloc = /* size I need */;
    counts = new long[nAlloc];
  }
  memset(counts, 0, sizeof(long)*nAlloc);
  /* do the rest of the stuff */
}

This way, counts is always big enough, and the point is to 1) do new as few times as possible, and 2) keep the indexing into counts as simple as possible.

But first I would do the pauses, just to be sure. After fixing it, I would do that again to see what's the next thing I could fix.

OTHER TIPS

Compiler optimizations are one place to look at but there is one more place that you need to look. Changes that you made in the code can disturb the cache layout. If memory allocated to the array is in a different part of memory each time, number of cache misses in your system can increase, which in turn degrades the performance. You can take a look at hardware performance counters on your CPU to make a better guess about it.

There are times when unorthodox solutions are applicable, and this may be one. Have you considered making the array a global?

Still, the one crucial benefit that local variables have is that the optimizer can find all access to it, using information from the function only. That makes register assignment a whole lot easier.

A static variable inside the function is almost the same, but in your case the address of that stack array would escape, beating the optimizer once again.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top