Thrust vectorized search: Efficiently combine lower_bound and binary_search to find both position and existence
-
16-06-2021 - |
Question
I'm trying to use Thrust to detect if each element of an array can be found in another array and where (both arrays are sorted). I came across the vectorized search routines (lower_bound and binary_search).
lower_bound will return for each value the index where it could be inserted in a list respecting its ordering.
I also need to know if the value is found or not (which can be done with binary_search), not just its position.
Is it possible to achieve both efficiently without making two searches (calling binary_search and then lower_bound)?
I know in the scalar case, lower_bound will return a pointer to end of the array if a value cannot be found, but this does not happens in the vectorized version.
Thanks!
Solution
You can check that the element that lower_bound
returns is the same as the one you searched for. E.g. given a = {1,3,5}
and searching for b = {1,4}
, the result will be c = {0,2}
. We have a[c[0]] == b[0]
, so b[0]
is in a
, but a[c[1]] != b[1]
so b[1]
is not in a
.
(Note that you will need to ensure that you don't make any out-of-bounds memory accesses, since lower_bound
can return an index that is beyond the end of the array.)
OTHER TIPS
@tat0: you can also play around with Arrayfire: vectorized search using lower_bound() does not give you the answer immediately while with setintersect() in arrayfire, you get the "intersection" of two arrays directly:
float A_host[] = {3,22,4,5,2,9,234,11,6,17,7,873,23,45,454};
int szA = sizeof(A_host) / sizeof(float);
float B_host[] = {345,5,55,6,7,8,19,2,63};
int szB = sizeof(B_host) / sizeof(float);
// initialize arrays from host data
array A(szA, 1, A_host);
array B(szB, 1, B_host);
array U = setintersect(A, B); // compute intersection of 2 arrays
int n_common = U.elements();
std::cout << "common: ";
print(U);
the output is: common: U = 2.0000 5.0000 6.0000 7.0000
to get the actual locations of these elements in array A, you can use the following construct (provided that elements in A are unique):
int n_common = U.elements();
array loc = zeros(n_common); // empty array
gfor(array i, n_common) // parallel for loop
loc(i) = sum((A == U(i))*seq(szA));
print(loc);
then: loc = 4.0000 3.0000 8.0000 10.0000
Furthermore, thrust::lower_bound() seems to be slower than setintersect(), i benchmarked it with the following program:
int *g_data = 0;
int g_N = 0;
void thrust_test() {
thrust::device_ptr<int> A = thrust::device_pointer_cast((int *)g_data),
B = thrust::device_pointer_cast((int *)g_data + g_N);
thrust::device_vector<int> output(g_N);
thrust::lower_bound(A, A + g_N, B, B + g_N,
output.begin(),
thrust::less<int>());
std::cout << "thrust: " << output.size() << "\n";
}
void af_test()
{
array A(g_N, 1, g_data, afDevicePointer);
array B(g_N, 1, g_data + g_N, afDevicePointer);
array U = setintersect(A, B);
std::cout << "intersection sz: " << U.elements() << "\n";
}
int main()
{
g_N = 3e6; // 3M entries
thrust::host_vector< int > input(g_N*2);
for(int i = 0; i < g_N*2; i++) { // generate some input
if(i & 1)
input[i] = (i*i) % 1131;
else
input[i] = (i*i*i-1) % 1223 ;
}
thrust::device_vector< int > dev_input = input;
// sort the vector A
thrust::sort(dev_input.begin(), dev_input.begin() + g_N);
// sort the vector B
thrust::sort(dev_input.begin() + g_N, dev_input.begin() + g_N*2);
g_data = thrust::raw_pointer_cast(dev_input.data());
try {
info();
printf("thrust: %.5f seconds\n", timeit(thrust_test));
printf("af: %.5f seconds\n", timeit(af_test));
} catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
}
return 0;
}
and the results:
CUDA toolkit 4.2, driver 295.59
GPU0 GeForce GT 650M, 2048 MB, Compute 3.0 (single,double)
Memory Usage: 1937 MB free (2048 MB total)
thrust: 0.13008 seconds
arrayfire: 0.06702 seconds