You can set your limits on stack space but it is futile. Too many threads, even with a pool, will eat it up at log2(N)*cost per thread. Go for an iterative approach and reduce your overhead. Overhead is the killer.
As far as performance goes you are going to find that using some level of over commit of N threads, where is is the hardware concurrency will probably yield the best results. There will be a good balance between overhead and work per core. If N get's very large, like on a GPU, then other options exist(Bitonic) that make different trade-offs to reduce the communication(waiting/joining) overhead.
Assuming you have a task manager and a semaphore that is contructed for N notifies before allowing the waiting task through,
`
#include <algorithm>
#include <array>
#include <cstdint>
#include <vector>
#include <sometaskmanager.h>
void parallel_merge( size_t N ) {
std::array<int, 1000> ary {0};
// fill array...
intmax_t stride_size = ary.size( )/N; //TODO: Put a MIN size here
auto semaphore = make_semaphore( N );
using iterator = typename std::array<int, 1000>::iterator;
std::vector<std::pair<iterator, iterator>> ranges;
auto last_it = ary.begin( );
for( intmax_t n=stride_size; n<N; n +=stride_size ) {
ranges.emplace_back( last_it, std::next(last_it, std::min(std::distance(last_it, ary.end()), stride_size)));
semaphore.notify( );
}
for( auto const & rng: ranges ) {
add_task( [&semaphore,rng]( ) {
std::sort( rng.first, rng.second );
});
}
semaphore.wait( );
std::vector<std::pair<iterator, iterator>> new_rng;
while( ranges.size( ) > 1 ) {
semaphore = make_semaphore( ranges.size( )/2 );
for( size_t n=0; n<ranges.size( ); n+=2 ) {
auto first=ranges[n].first;
auto last=ranges[n+1].second;
add_task( [&semaphore, first, mid=ranges[n].second, last]( ) {
std::inplace_merge( first, mid, last );
semaphore.notify( );
});
new_rng.emplace_back( first, last );
}
if( ranges.size( ) % 2 != 0 ) {
new_rng.push_back( ranges.back( ) );
}
ranges = new_rng;
semaphore.wait( );
}
}
As you can see, the bottleneck is in the merge phase as there is a lot of cordination that must be done. Sean Parent does a good presentation of building a task manager if you don't have one and about how it compares along with a relative performance analysis in his presentation Better Code: Concurrency, http://sean-parent.stlab.cc/presentations/2016-11-16-concurrency/2016-11-16-concurrency.pdf . TBB and PPL have task managers.