Is there an overhead on parallel_for (Inter TBB) similar to the overhead we see on std::function?

Question 1

Passing the functor using std::cref is likely to be counterproductive, but I make no promises. Only empirical testing in the precise context of interest can be definitive. In general, for tbb::parallel_for, my recommendation is:

Pass the lambda by value.
Unless there are semantic considerations that dictate a capture mode, have the lambda objects by reference unless they are small objects that are cheap to copy. Remember that typically the captured variables will accessed many more times than the lambda is copied.

Does TBB pay the cost of heap allocation for the functor? The answer is definitely no for the signature of the form parallel_for(first,*last*,functor), because that form passes the functor by reference.

For the signature of the form parallel_for(range,*functor*), as in the question, the answer is "no additional cost". It does not heap-allocate the functor directly. But each task that TBB creates has a copy of the functor, and the tasks are heap-allocated (usually quickly via local free-lists). Using std::cref is not going to change the fact that the tasks are heap allocated. Using std::cref will just add an extra level of indirection.

I was actually a little surprised that one form of tbb::parallel_for passes the functor by reference and another by value. I forget the reason, and I'm sure the TBB group must have debated it. The choice may have been motivated by whatever benchmarks and machines were available at the time each was introduced, or maybe it's a PPL compatibility issue with the "first,last" form, which seems to not require that the functor be copy-constructible. As hinted at earlier, the performance tradeoff of passing-by-reference versus passing-by-value is not simple. The passing-by-reference makes passing the functor around cheap, but adds the cost of indirection to each time it is accessed (unless the compiler can optimize it away).

As to the lifetime of the functor argument, it just has to exist for the duration of the call to parallel_for.

Question 2

Should I pass my functors/lambdas by reference to parallel_for using std::cref to speed up the code?

I don't know the answer to your main question. But it doesn't matter because you should never do that with tbb::parallel_for.

As Cassio Neri pointed out in his answer:

Finally, notice that the lifetime of the lambda encloses that of the std::function.

That was true for the circumstances of the question he was asking. But this is not true for tbb::parallel_for. The entire point of parallel_for is that it will call the given function from other threads at an arbitrary time in the future.

If you give it some functor by reference, then you must ensure that this functor's lifetime continues until the parallel_for is finished. Otherwise, parallel_for may try to call a reference to a destroyed object.

That's bad.

So regardless of whatever overhead may happen, you can't cure it with references.