Most cpu-efficient way to use std:: algorithms with arguments to a variadic function template?

Question 1

Apparently yes but in a twisted way. Consider the following code:

#include <algorithm>
#include <array>
#include <cstdio>
#include <iterator>

template<typename... Ts>
int sum(Ts... numbers) {
    std::array<int,sizeof...(numbers)> list{{numbers...}};
    return std::accumulate(std::begin(list), std::end(list), 0);
}

__attribute__((noinline))
void f(int x, int y, int z) {
  std::printf("sum = %d\n", sum(x, y, z));
}

int main(int argc, char* argv[]) {
  int x = std::atoi(argv[1]);
  int y = std::atoi(argv[2]);
  int z = std::atoi(argv[3]);    
  f(x, y, z);
}

I looked at the generated assembly code. Here is what sum() is optimized into by clang, the assembly code rewritten to C by me for clarity:

int sum(int x, int y, int z) {
  int tmp = x;
  tmp += y;
  tmp += z;
  return tmp;
}

I can say that the generated assembly code is optimal! It got rid of the temporary std::array and unrolled the loop in std::accumulate().

So the answer to your question: even if you create a temporary iterable container, it can be optimized away if the compiler is smart enough and your numeric types are simple enough (built-in types or PODs). You won't pay for the creation of a temporary container or for copying the elements into the temporary container if it can be optimized away.

Sadly, gcc 4.7.2 wasn't that dexterous:

int sum(int x, int y, int z) {
  int a[3];
  a[0] = x;
  a[1] = y;
  a[2] = z;
  int tmp = x;
  tmp += y;
  tmp += z;
  return tmp;
}

Unfortunately, it did not recognize that it can get rid of the temporary array. I will check that with the latest gcc from trunk and if the problem still exists, file a bug report; it seems like a bug in the optimizer.

Question 2

You can create a container of std::reference_wrapper objects and accumulate over that. Here's an example:

#include <iostream>
#include <functional>
#include <algorithm>
#include <array>
using namespace std;
template<typename... ArgTypes>
int sum(const ArgTypes&... numbers) {
    array<reference_wrapper<const int>, sizeof...(numbers)> A = {ref(numbers)...};
    return accumulate(A.begin(), A.end(), 0);
}
int main() {
    int x = 1, y = 2, z = 3;
    cout << sum(x, y, z) << endl; // prints 6
}

I'll leave it to you to figure out how to adapt this to a more general setting like in your question.

Question 3

Once your template is instanciated, the compiler sees a number of distinct parameters. Each argument can even be of a different type.

It's exactly as if you wanted to iterate over the arguments of fun (a, b, c, d) and expect the code optimizer to cope with layers upon layers of obfuscation.

You could go for a recursive template, but that would be as cryptic as inefficient.

You could design a template-less variadic function, but then you would have to use the <cstdarg> interface and could kiss std::accumulate goodbye.

Possibly you could use the variadic arguments as an initializer for a plain old array and use std::accumulate on it, provided you restrict the use of your shiny new toy to possibly inlineable parameters, namely a list of objects that can be converted to a single base type at compile time.

If you have big and costly objects, this method can still be used with const references to the said objects. I suppose you will spend quite a bit of time optimizing the operators involved in the accumulation computation if you want to squeeze performances out of it, but well, anything is doable with enough blood and sweat.

#include <array>
#include <numeric>
#include <type_traits>

using namespace std;


// Since we need to get back the base type, might as well check that the
// "optimized" code is not fed with junk that would require countless implicit
// conversions and prevent the compiler from inlining the stupid dummy function
// that should just act as a wrapper for the underlying array initialization.
template<class T, class...>
struct same_type
{
    static const bool value = true;
    typedef T type;
};

template<class Ta, class Tb, class... Types>
struct same_type<Ta, Tb, Types...>
{
    static const bool value = is_same<Ta,Tb>::value && same_type<Tb, Types...>::value;
    typedef Ta type;
};

// --------------------------------------------------------
// dummy function just here to make a copy of its arguments
// and pass it to std::accumulate
// --------------------------------------------------------
template<typename F, typename...Args> 
typename same_type<Args...>::type do_something(F fun, Args...args)
{
    // just a slight bit less of obfuscation
    using base_type = same_type<Args...>::type;

    // make sure all arguments have the same type
    static_assert(same_type<Args...>::value, "all do_something arguments must have the same type");

    // arguments as array
    array<base_type, sizeof...(Args)> values = { args... };
    return accumulate (values.begin(), values.end(), (base_type)0, fun);
}

// --------------------------------------------------------
// yet another glorious functor
// --------------------------------------------------------
struct accumulator {
    template<class T>
    T operator() (T res, T val)
    {
        return res + val;
    }
};

// --------------------------------------------------------
// C++11 in its full glory
// --------------------------------------------------------
int main(void)
{
    int    some_junk = do_something(accumulator(),1,2,3,4,6,6,7,8,9,10,11,12,13,14,15,16);
    double more_junk = do_something(accumulator(),1.0,2.0,3.0,4.0,6.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0);
    return some_junk+(int)more_junk;
}

I had a look at the muck generated by the latest Microsoft compiler. It did inline the double version entirely. Most of the code is busy initializing the array, the rest is a half-dozen instructions loop. Note that the loop was not unrolled either.

It did not inline the int version completely. It removed the dummy function call but generated an instance of the accumulate template.

Not surprisingly, the compiler won't bother to optimize a function if it reckons the number and size of parameters passed don't justify it, since it has no way to know this piece of code will be called a few gazillion times per second by an idiotic design.

You could certainly have a lot of fun sprinkling the code with register and inline directives and pragmas and tweaking the compiler optimization options, but it's a dead end, IMHO.

A perfect example of bad design using supposedly cutting edge technology to bonk rocks together, if you ask me.