Passing the functor using std::cref is likely to be counterproductive, but I make no promises. Only empirical testing in the precise context of interest can be definitive. In general, for tbb::parallel_for, my recommendation is:
- Pass the lambda by value.
- Unless there are semantic considerations that dictate a capture mode, have the lambda objects by reference unless they are small objects that are cheap to copy. Remember that typically the captured variables will accessed many more times than the lambda is copied.
Does TBB pay the cost of heap allocation for the functor? The answer is definitely no for the signature of the form parallel_for(first,*last*,functor), because that form passes the functor by reference.
For the signature of the form parallel_for(range,*functor*), as in the question, the answer is "no additional cost". It does not heap-allocate the functor directly. But each task that TBB creates has a copy of the functor, and the tasks are heap-allocated (usually quickly via local free-lists). Using std::cref is not going to change the fact that the tasks are heap allocated. Using std::cref will just add an extra level of indirection.
I was actually a little surprised that one form of tbb::parallel_for passes the functor by reference and another by value. I forget the reason, and I'm sure the TBB group must have debated it. The choice may have been motivated by whatever benchmarks and machines were available at the time each was introduced, or maybe it's a PPL compatibility issue with the "first,last" form, which seems to not require that the functor be copy-constructible. As hinted at earlier, the performance tradeoff of passing-by-reference versus passing-by-value is not simple. The passing-by-reference makes passing the functor around cheap, but adds the cost of indirection to each time it is accessed (unless the compiler can optimize it away).
As to the lifetime of the functor argument, it just has to exist for the duration of the call to parallel_for.