使用iTerand努力的A-Priori估算来预测并行循环的运行时（给定的工人数量）

Question

我提出了一个令人满意的解决方案，因此，如果有人有兴趣，我会分享它。我仍然感谢您如何改进/微调该方法的评论。

基本上，我认为唯一明智的方法是为并行循环构建调度程序的（非常）基本模型：

function c=est_cost_para(cost_blocks,cost_it,num_cores)
% Estimate cost of parallel computation

% Inputs:
%   cost_blocks: Estimate of cost per block in arbitrary units. For
%       consistency with the other code this must be in the reverse order
%       that the scheduler is fed, i.e. cost should be ascending!
%   cost_it:     Base cost of iteration (regardless of number of entries)
%       in the same units as cost_blocks.
%   num_cores:   Number of cores
%
% Output:
%   c: Estimated cost of parallel computation

num_blocks=numel(cost_blocks);
c=zeros(num_cores,1);

i=min(num_blocks,num_cores);
c(1:i)=cost_blocks(end-i+1:end)+cost_it;
while i<num_blocks
    i=i+1;
    [~,i_min]=min(c); % which core finished first; is fed with next block
    c(i_min)=c(i_min)+cost_blocks(end-i+1)+cost_it;
end

c=max(c);

end

参数 cost_it 对于空的迭代是许多不同副作用的粗略混合，可以想象是分开的：一个空的迭代成本 for/parfor- 环（每个块也可能是不同的），以及启动时间resp。数据传输 parfor- 环（可能更多）。将所有内容汇总在一起的主要原因是，我不想估计/确定更精细的成本。

我使用上述例程以以下方式确定截止：

% function i=cutoff_ser_para(cost_blocks,cost_it,num_cores)
% Determine cut-off between serial an parallel regime

% Inputs:
%   cost_blocks: Estimate of cost per block in arbitrary units. For
%       consistency with the other code this must be in the reverse order
%       that the scheduler is fed, i.e. cost should be ascending!
%   cost_it:     Base cost of iteration (regardless of number of entries)
%       in the same units as cost_blocks.
%   num_cores:   Number of cores
%
% Output:
%   i: Number of blocks to be calculated serially

num_blocks=numel(cost_blocks);
cost=zeros(num_blocks+1,2);

for i=0:num_blocks
    cost(i+1,1)=sum(cost_blocks(end-i+1:end))/num_cores + i*cost_it;
    cost(i+1,2)=est_cost_para(cost_blocks(1:end-i),cost_it,num_cores);
end

[~,i]=min(sum(cost,2));
i=i-1;

end

特别是，我不会夸大/更改价值 est_cost_para 假设（除了 cost_it）最乐观的计划。我把它留下来主要是因为我不知道什么会最好。要保守（即避免喂食太大的块到平行循环），当然可以添加一些百分比作为缓冲区，甚至可以使用> 1的功率来膨胀并行成本。

还请注意 est_cost_para 被称为较少的块（尽管我使用变量名称） cost_blocks 对于两个例程，一个是另一个子集）。

与我的Wordy问题中的方法相比，我看到了两个主要优点：

数据之间的相对复杂的依赖性（块数量及其成本）和模拟调度程序捕获的核心数量要比单个公式所能捕获的要好得多。
通过计算串行/并行分布的所有可能组合的成本，然后将最低分配的成本计算，一个人在从一侧读取数据时就不会太早地“卡住”（例如，跳跃相对于数据相对于数据而言是大的，但与总数相比很小）。

当然，通过调用渐近复杂性更高 est_cost_para 一直以来一直在循环，但就我而言num_blocks<500）这绝对可以忽略不计。

最后，如果一个体面的价值 cost_it 不容易出现，可以尝试通过测量每个块的实际执行时间以及其纯粹的平行部分来计算它 cost_it 对于例行程序的下一个呼叫（通过使用总成本和并行成本之间的差异或将零成本插入拟合公式）。希望这应该“汇聚”到最有用的价值 cost_it 对于所讨论的问题。