Question

Consider two numpy arrays

a = np.array(['john', 'bill', 'greg', 'bill', 'bill', 'greg', 'bill'])
b = np.array(['john', 'bill', 'greg'])

How would I be able to produce a third array

c = np.array([0,1,2,1,1,2,1])

The same length as a representing the index of each entry of a in the array b?

I can see a way by looping over the elements of b as b[i] and checking np.where(a == b[i]) but was wondering if numpy could accomplish this in a quicker/better/less lines of code way.

Was it helpful?

Solution

Here is one option:

import numpy as np

a = np.array(['john', 'bill', 'greg', 'bill', 'bill', 'greg', 'bill'])
b = np.array(['john', 'bill', 'greg'])

my_dict = dict(zip(b, range(len(b))))

result = np.vectorize(my_dict.get)(a)

Result:

>>> result
array([0, 1, 2, 1, 1, 2, 1])

OTHER TIPS

Sorting is a good option for vectorization with numpy:

>>> s = np.argsort(b)
>>> s[np.searchsorted(b, a, sorter=s)]
array([0, 1, 2, 1, 1, 2, 1], dtype=int64)

If your array a has m elements and b has n, the sorting is going to be O(n log n), and the searching O(m log n), which is not bad. Dictionary based solutions should be amortized linear, but if the arrays are not huge the Python looping may make them slower than this. And broadcasting based solutions have quadratic complexity, they will only be faster for very small arrays.


Some timings with your sample:

In [3]: %%timeit
   ...: s = np.argsort(b)
   ...: np.take(s, np.searchsorted(b, a, sorter=s))
   ...: 
100000 loops, best of 3: 4.16 µs per loop

In [5]: %%timeit
   ...: my_dict = dict(zip(b, range(len(b))))
   ...: np.vectorize(my_dict.get)(a)
   ...: 
10000 loops, best of 3: 29.9 µs per loop

In [7]: %timeit (np.arange(b.size)*(a==b[:,newaxis]).T).sum(axis=-1)
100000 loops, best of 3: 18.5 µs per loop

Create a dictionary for translating each string to number and then use numpy.vectorize for creating the output array

>>> import numpy as np
>>> a = np.array(['john', 'bill', 'greg', 'bill', 'bill', 'greg', 'bill'])
>>> b = np.array(['john', 'bill', 'greg'])
>>> d = {k:v for v, k in enumerate(b)}
>>> c = np.vectorize(d.get)(a)
>>> c
 array([0, 1, 2, 1, 1, 2, 1])

This is more efficient than looping and doing np.where(a == b[i]) because you only visit one element of the array once.

Fully numpy solution:

(arange(b.size)*(a==b[:,newaxis]).T).sum(axis=-1)

Another solution is possible by:

arr, bSorted, ind =  np.unique(a, return_index=True, return_inverse=True)
c = bSorted[ind]

If you wanted to get the unique elements out of a and do not care about the order in b, i.e. b and therefore c will look differently, then it can be simplified to

b, c = np.unique(a, return_inverse=True)

Since the array b contains unique elements, equality with an element of a can only ever be with one single element of b. If all elements of a are definitely in b, then

import numpy as np
indices = np.where(a[:, np.newaxis] == b)[1]

will do the trick. If you are not sure whether all elements of a are in b, then

in_b, indices = np.where(a[:, np.newaxis] == b)

will collect all elements of a which are contained in b in in_b

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top