The tbb::parallel_for of the form parallel_for(first,last,lambda) does some load balancing. You might try it first. Though it has a heuristic for guessing a good grainsize that can be fooled on occasion.
For best load balancing, possibly at the expense of extra per-iteration overhead, use a range-based tbb::parallel_for with a grainsize of 1 and a simple_partitioner. That forces each iteration to run as a separate task, thus giving the TBB runtime maximum flexibility to rebalance load. Below is a sample that executes 100 iterations, each with a random delay.
#include <tbb/parallel_for.h>
#include <unistd.h>
int main( int argc, char* argv[] ) {
tbb::parallel_for(
tbb::blocked_range<int>(0,100,1), // Interval [0,100) with grainsize==1
[&](tbb::blocked_range<int> r) {
for( int i=r.begin(); i!=r.end(); ++i ) {
printf("%d\n",i);
usleep(random()%1000000);
}
},
tbb::simple_partitioner());
}