How do I perform an (almost-)branch-less binary search on arbitrary sorted data?

Question

Here's a version of std::lower_bound which had only 1 branch (namely, the begin != end test) when I tested it with Visual C++ 2012 (64-bit):

template<class FwdIt, class T, class P>
FwdIt branchless_lower_bound(FwdIt begin, FwdIt end, T const &val, P pred)
{
    while (begin != end)
    {
        FwdIt middle(begin);
        std::advance(middle, std::distance(begin, end) >> 1);
        FwdIt middle_plus_one(middle);
        ++middle_plus_one;
        bool const b = pred(*middle, val);
        begin = b ? middle_plus_one : begin;
        end = b ? end : middle;
    }
    return begin;
}

32-bit with SSE2 support would probably be able to use the Conditional-Move instruction as well, to gain similar speed.

Now the speed should be competitive with linear search for small arrays... but it might be worth checking.

Interestingly, I found that for a vector<int> up to size (approximately) 45 on my CPU, a linear search is still faster! Not sure why though, or if my measurement was accurate...

Also turns out that this isn't any faster than a branching binary search on my i5 CPU.