I'd like @chepner hack.
And it seems not so tricky accomplish similar behaviour with limiting number of parallel executions:
while IFS=$'\t' read -r f1 f2;
do
myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" &
# At most as number of CPU cores
[ $( jobs | wc -l ) -ge $( nproc ) ] && wait
done < fileinput
wait
It limit execution at max of number of CPU cores present on system. You may easily vary that by replace $( nproc )
by desired amount.
Meantime you should understand what it is not honest distribution. So, it not start new thread just after one finished. Instead it just wait finishing all, after start max amount. So summary throughput may be slightly less than with parallel. Especially if run time of your program may vary in big range. If time spent on each invocation is almost same then summary time also should be roughly equivalent.