As others have already pointed out, lambdas and function objects are likely to be inlined, especially if the body of the function is not too long. As a consequence, they are likely to be better in terms of speed and memory usage than the std::function
approach. The compiler can optimize your code more aggressively if the function can be inlined. Shockingly better. std::function
would be my last resort for this reason, among other things.
But when working with dozens (or hundreds) of templates, and numerous functors, the compilation times and memory usage difference can be substantial.
As for the compilation times, I wouldn't worry too much about it as long as you are using simple templates like the one shown. (If you are doing template metaprogramming, yeah, then you can start worrying.)
Now, the memory usage: By the compiler during compilation or by the generated executable at run time? For the former, same holds as for the compilation time. For the latter: Inlined lamdas and function objects are the winners.
Can we say that in many circumstances
std::function
(or even function pointers) must be preferred over templates+raw functors/lambdas? I.e. wrapping your functor or lambda withstd::function
may be very convenient.
I am not quite sure how to answer to that one. I cannot define "many circumstances".
However, one thing I can say for sure is that type erasure is a way to avoid / reduce code bloat due to templates, see Item 44: Factor parameter-independent code out of templates in Effective C++. By the way, std::function
uses type erasure internally. So yes, code bloat is an issue.
I am aware that std::function (function pointer too) introduces an overhead. Is it worth it?
"Want speed? Measure." (Howard Hinnant)
One more thing: function calls through function pointers can be inlined (even across compilation units!). Here is a proof:
#include <cstdio>
bool lt_func(int a, int b)
{
return a<b;
}
void compare_int(int a, int b, const char* msg, bool (*cmp_func) (int a, int b)) {
if (cmp_func(a, b)) printf("a is %s b\n", msg);
else printf("a is not %s b\n", msg);
}
void f() {
compare_int (10, 5, "less than", lt_func);
}
This is a slightly modified version of your code. I removed all the iostream stuff because it makes the generated assembly cluttered. Here is the assembly of f()
:
.LC1:
.string "a is not %s b\n"
[...]
.LC2:
.string "less than"
[...]
f():
.LFB33:
.cfi_startproc
movl $.LC2, %edx
movl $.LC1, %esi
movl $1, %edi
xorl %eax, %eax
jmp __printf_chk
.cfi_endproc
Which means, that gcc 4.7.2 inlined lt_func
at -O3
. In fact, the generated assembly code is optimal.
I have also checked: I moved the implementation of lt_func
into a separate source file and enabled link time optimization (-flto
). GCC still happily inlined the call through the function pointer! It is nontrivial and you need a quality compiler to do that.
Just for the record, and that you can actually feel the overhead of the std::function
approach:
This code:
#include <cstdio>
#include <functional>
template <class Compare> void compare_int(int a, int b, const char* msg, Compare cmp_func)
{
if (cmp_func(a, b)) printf("a is %s b\n", msg);
else printf("a is not %s b\n", msg);
}
void f() {
std::function<bool(int,int)> func_lt = [](int a, int b) {return a<b;};
compare_int (10, 5, "less than", func_lt);
}
yields this assembly at -O3
(approx. 140 lines):
f():
.LFB498:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
.cfi_lsda 0x3,.LLSDA498
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movl $1, %edi
subq $80, %rsp
.cfi_def_cfa_offset 96
movq %fs:40, %rax
movq %rax, 72(%rsp)
xorl %eax, %eax
movq std::_Function_handler<bool (int, int), f()::{lambda(int, int)#1}>::_M_invoke(std::_Any_data const&, int, int), 24(%rsp)
movq std::_Function_base::_Base_manager<f()::{lambda(int, int)#1}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<f()::{lambda(int, int)#1}> const&, std::_Manager_operation), 16(%rsp)
.LEHB0:
call operator new(unsigned long)
.LEHE0:
movq %rax, (%rsp)
movq 16(%rsp), %rax
movq $0, 48(%rsp)
testq %rax, %rax
je .L14
movq 24(%rsp), %rdx
movq %rax, 48(%rsp)
movq %rsp, %rsi
leaq 32(%rsp), %rdi
movq %rdx, 56(%rsp)
movl $2, %edx
.LEHB1:
call *%rax
.LEHE1:
cmpq $0, 48(%rsp)
je .L14
movl $5, %edx
movl $10, %esi
leaq 32(%rsp), %rdi
.LEHB2:
call *56(%rsp)
testb %al, %al
movl $.LC0, %edx
jne .L49
movl $.LC2, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
.LEHE2:
.L24:
movq 48(%rsp), %rax
testq %rax, %rax
je .L23
leaq 32(%rsp), %rsi
movl $3, %edx
movq %rsi, %rdi
.LEHB3:
call *%rax
.LEHE3:
.L23:
movq 16(%rsp), %rax
testq %rax, %rax
je .L12
movl $3, %edx
movq %rsp, %rsi
movq %rsp, %rdi
.LEHB4:
call *%rax
.LEHE4:
.L12:
movq 72(%rsp), %rax
xorq %fs:40, %rax
jne .L50
addq $80, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 16
popq %rbx
.cfi_def_cfa_offset 8
ret
.p2align 4,,10
.p2align 3
.L49:
.cfi_restore_state
movl $.LC1, %esi
movl $1, %edi
xorl %eax, %eax
.LEHB5:
call __printf_chk
jmp .L24
.L14:
call std::__throw_bad_function_call()
.LEHE5:
.L32:
movq 48(%rsp), %rcx
movq %rax, %rbx
testq %rcx, %rcx
je .L20
leaq 32(%rsp), %rsi
movl $3, %edx
movq %rsi, %rdi
call *%rcx
.L20:
movq 16(%rsp), %rax
testq %rax, %rax
je .L29
movl $3, %edx
movq %rsp, %rsi
movq %rsp, %rdi
call *%rax
.L29:
movq %rbx, %rdi
.LEHB6:
call _Unwind_Resume
.LEHE6:
.L50:
call __stack_chk_fail
.L34:
movq 48(%rsp), %rcx
movq %rax, %rbx
testq %rcx, %rcx
je .L20
leaq 32(%rsp), %rsi
movl $3, %edx
movq %rsi, %rdi
call *%rcx
jmp .L20
.L31:
movq %rax, %rbx
jmp .L20
.L33:
movq 16(%rsp), %rcx
movq %rax, %rbx
testq %rcx, %rcx
je .L29
movl $3, %edx
movq %rsp, %rsi
movq %rsp, %rdi
call *%rcx
jmp .L29
.cfi_endproc
Which approach would you like to choose when it comes to performance?