Question

In my project I have a class where execution time is the first goal. For it I don’t care much about maintenance, order and so on. At least I didn’t care till yesterday... Now I’m in situation where I have to be a little worried about it, too.

I have a class, say A, which performs multiple scans on images coming from a camera, i.e. a variable width window scans them in real time.



    class A{  
    // methods and attributes of A:  
    ...  
    void runiterator(){  
      ...  
    for{    // change window’s dimension  
      for{  // rows  
       for{ // columns  
          // many lines of code of operations to be executed for each window at each position  
          ...      
       }  
      }   
     }  
    }  
    }; 

Performance shows already a little delay, but I could solve it skipping a limited area of the image. Furthermore I have a second function, say B, which has exactly the same scheme as A, and executes different operations on each scan (and luckily is much faster than A).

Well, now it is time to join all the operations to benefit significantly the overall result. Only that the code would really become messed-up, huge and mixing things that are really different. I thought to define a class X that does the iterations and at each scan executes function calls to one function in A_new and one in B_new. But I’m worried that about 200000x2 function calls per image would result in loss in performance.

What is your advice?

EDIT
With class X that only calls Anew (so it could only be compared to what now is A), I get on the average, out of many repetitions:

Time for executing X on a series of 56 images = 6.15 s
Time for executing A on thesame series of 56 images = 5.98 s

It seems that my suspects were not so naive.
The difference is about 3%, not so much, but still sorry for the loss.

With __forceinline time is 5.98 s for X as well, but I would prefer not relying on it.

I think that code is optimized and margins for further improvements are very little.
Indeed it does a lot of stuff on images in a relatively short time.
Processing data sequentially is not possible in class A because it is based on the values coming from the images which are impredictable. This is the reason why class B (that manages to do it) is much faster.

Was it helpful?

Solution

You really have to measure that it causes performance problems before worrying about it.

If there is a problem, try to do it with templates. Write the two variants of the functions, then use them as functors in your function template which does the iteration. You'll instantiate both versions, and call the appropriate one. The compiler should inline the calls (but better verify this).

I used this on medical image manipulation and it worked like a charm.

OTHER TIPS

Function call overheads depend a lot on the type of function you're calling. At an assembler level (assuming no sneaky OS page fault handling) the call address instruction on a modern Intel processor takes 0 cycles (the same is true for the jmp address instruction). Overheads are introduced if the function address is calculated from some data source, like a vtbl lookup, a call to an external DLL (if you're using Win32) or there's a conditional involved. These involve accessing memory and polluting the cache. Which brings us to the biggie.

Most performance is lost in waiting for the data to arrive into the CPU. The speed of the CPU is far greater than the rate at which data can be read from RAM. This is why there are several levels of caching each level is generally bigger and slower than the previous level. The cost of a function call, even a complex one, is less than the time lost to missing the cache on a data read.

This sort of thing comes under the heading: micro optimisation.

Generally, avoid random data accesses, process the data sequentially, i.e. do item n, then n+1, n+2, etc and not n, n+100, n+200, etc, n+1, n+101, n+201, etc.

Also, give the compiler the opportunity to inline functions - that way the inlining will be done if the result produces faster code (and the compiler has a very good idea about when that's beneficial).

Also note that big functions can be slower than lots of small functions (it's to do with the buffer of uops the CPU caches locally). Iterating over the data several times might be quicker than doing everything in one go. Only profiling the code will tell you which one is quicker.

Finally, a better algorithm is usually the way to better performance. Is your algorithm optimal?

You have to measure the effects as it's difficult to tell what the compiler really produces – especially at O3. Sure, function calls have a overhead if the functions are not inlined by the compiler. Try to provide the inline hint to the compiler if the function can be inlined.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top