Here's a version of std::lower_bound
which had only 1 branch (namely, the begin != end
test) when I tested it with Visual C++ 2012 (64-bit):
template<class FwdIt, class T, class P>
FwdIt branchless_lower_bound(FwdIt begin, FwdIt end, T const &val, P pred)
{
while (begin != end)
{
FwdIt middle(begin);
std::advance(middle, std::distance(begin, end) >> 1);
FwdIt middle_plus_one(middle);
++middle_plus_one;
bool const b = pred(*middle, val);
begin = b ? middle_plus_one : begin;
end = b ? end : middle;
}
return begin;
}
32-bit with SSE2 support would probably be able to use the Conditional-Move instruction as well, to gain similar speed.
Now the speed should be competitive with linear search for small arrays... but it might be worth checking.
Interestingly, I found that for a vector<int>
up to size (approximately) 45 on my CPU, a linear search is still faster! Not sure why though, or if my measurement was accurate...
Also turns out that this isn't any faster than a branching binary search on my i5 CPU.