Domanda

I have three rather large NumPy arrays with varying numbers of rows, whose first columns are all integers. My hope is to filter these arrays such that the only rows left are those for whom the value in the first column is shared by all three. This would leave three arrays of the same size. The entries in the other columns are not necessarily shared across arrays.

So, with input:

A = 
[[1, 1],
[2, 2],
[3, 3],]

B = 
[[2, 1],
[3, 2],
[4, 3],
[5, 4]]

C = 
[[2, 2],
[3, 1]
[5, 2]]

I hope to get back as output:

A = 
[[2, 2],
[3, 3]]


B = 
[[2, 1],
[3, 2]]

C = 
[[2, 2],
[3, 1]]

My current approach is to:

  1. Find the intersection of the three first columns using numpy.intersect1d()

  2. Use numpy.in1d() on this intersection and the first columns of each array to find the row indices that are not shared in each array (converting boolean to index using a modified version of the method found here: Python: intersection indices numpy array )

  3. Finally using numpy.delete() with each of these indices and its respective array to remove rows with non-shared entries in the first column.

I'm wondering if there might be a faster or more elegantly Pythonic way to go about this however, something that is suited to very large arrays.

È stato utile?

Soluzione

Your indices in your example are sorted and unique. Assuming this is no coincidence (and this situation often arises, or can easily be enforced), the following works:

import numpy as np

A = np.array(
[[1, 1],
[2, 2],
[3, 3],])

B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])

C = np.array(
[[2, 2],
[3, 1],
[5, 2],])

I = reduce(
    lambda l,r: np.intersect1d(l,r,True),
    (i[:,0] for i in (A,B,C)))

print A[np.searchsorted(A[:,0], I)]
print B[np.searchsorted(B[:,0], I)]
print C[np.searchsorted(C[:,0], I)]

and in case the first column is not in sorted order (but is still unique):

C = np.array(
[[9, 2],
[1,6],
[5, 1],
[2, 5],
[3, 2],])

def index_by_first_column_entry(M, keys):
    colkeys = M[:,0]
    sorter = np.argsort(colkeys)
    index = np.searchsorted(colkeys, keys, sorter = sorter)
    return M[sorter[index]]

print index_by_first_column_entry(C, I)

and make sure to change the true to false in

I = reduce(
    lambda l,r: np.intersect1d(l,r,False),
    (i[:,0] for i in (A,B,C)))

generalization to duplicate values can be made using np.unique

Altri suggerimenti

One way to do this is to build an indicator array, or a hash table if you like, to indicate which integers are in all your input arrays. Then you can use boolean indexing based on this indicator array to get the subarrays. Something like this:

import numpy as np

# Setup
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])

B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])

C = np.array(
[[2, 2],
[3, 1],
[5, 2],])


def take_overlap(*input):
    n = len(input)
    maxIndex = max(array[:, 0].max() for array in input)
    indicator = np.zeros(maxIndex + 1, dtype=int)
    for array in input:
        indicator[array[:, 0]] += 1
    indicator = indicator == n

    result = []
    for array in input:
        # Look up each integer in the indicator array
        mask = indicator[array[:, 0]]
        # Use boolean indexing to get the sub array
        result.append(array[mask])

    return result

subA, subB, subC = take_overlap(A, B, C)

This should be quite fast and this method does not assume the elements of the input arrays are unique or sorted. However this method could take a lot of memory, and might e a bit slower, if the indexing integers are sparse, ie [1, 10, 10000], but should be close to optimal if the integers are more or less dense.

This works but I'm not sure if it is faster than any of the other answers:

import numpy as np

A = np.array(
[[1, 1],
[2, 2],
[3, 3],])

B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])

C = np.array(
[[2, 2],
[3, 1],
[5, 2],])

a = A[:,0]
b = B[:,0]
c = C[:,0]

ab = np.where(a[:, np.newaxis] == b[np.newaxis, :])
bc = np.where(b[:, np.newaxis] == c[np.newaxis, :])

ab_in_bc = np.in1d(ab[1], bc[0])
bc_in_ab = np.in1d(bc[0], ab[1])

arows = ab[0][ab_in_bc]
brows = ab[1][ab_in_bc]
crows = bc[1][bc_in_ab]

anew = A[arows, :]
bnew = B[brows, :]
cnew = C[crows, :]

print(anew)
print(bnew)
print(cnew)

gives:

[[2 2]
 [3 3]]
[[2 1]
 [3 2]]
[[2 2]
 [3 1]]
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top