سؤال

I am using Python (and have access to pandas, numpy, scipy).

I have two sets strings set A and set B. Each set A and B contains c. 2000 elements (each element being a string). The strings are around 50-100 characters long comprising up to c. 20 words (these sets may get much larger).

I wish to check if an member of set A is also a member of set B.

Now I am thinking a naive implementation can be visualised as a matrix where members in A and B are compared to one another (e.g. A1 == B1, A1 == B2, A1 == B3 and so on...) and the booleans (0, 1) from the comparison comprise the elements of the matrix.

What is the best way to implement this efficiently?

Two further elaborations:

(i) I am also thinking that for larger sets I may use a Bloom Filter (e.g. using PyBloom, pybloomfilter) to hash each string (i.e. I dont mind fasle positives so much...). Is this a good approach or are there other strategies I should consider?

(ii) I am thinking of including a Levenshtein distance match between strings (which I know can be slow) as I may need fuzzy matches - is there a way of combining this with the approach in (i) or otherwise making it more efficient?

Thanks in advance for any help!

هل كانت مفيدة؟

المحلول

Firstly, 2000 * 100 chars is'nt that big, you could use a set directly.

Secondly, if your strings are sorted, there is a quick way (which I found here)to compare them, as follows:

def compare(E1, E2):
    i, j = 0, 0
    I, J = len(E1), len(E2)
    while i < I:
        if j >= J or E1[i] < E2[j]:
            print(E1[i], "is not in E2")
            i += 1
        elif E1[i] == E2[j]:
            print(E1[i], "is in E2")
            i, j = i + 1, j + 1
        else:
            j += 1

It is certainly slower than using a set, but it doesn't need the strings to be hold into memory (only two are needed at the same time).

For the Levenshtein thing, there is a C module which you can find on Pypi, and which is quite fast.

نصائح أخرى

As mentioned in the comments:

def compare(A, B):
    return list(set(A).intersection(B))

This is a modified version of the function that @michaelmeyer presented here https://stackoverflow.com/a/17264117/362951 - in his answer to the question on top of the page we are on.

The modified version below works also on unsorted data, because the function now includes the sorting.

This should not be a performance or resource problem in many cases, because python sorting is very effective. And presorting also helps.

Please note that the 'output' is now in sorted order too. This will differ from the original order of the first parameter, if it was unsorted.

Otherwise the sorting won't hurt much, even if both data sets are already sorted.

But if you want to suppress the sorting, in case both data sets are known to be sorted in ascending order already, call it like this:

compare(my_data1,my_data2,data_is_sorted=True)

Otherwise:

compare(my_data1,my_data2)

and the function accepts unordered data.

This is the modified version. Only the first two lines were added and a third optional parameter:

def compare(E1, E2, data_is_sorted=False):
    if not data_is_sorted:
        E1=sorted(E1)
        E2=sorted(E2)
    i, j = 0, 0
    I, J = len(E1), len(E2)
    while i < I:
        if j >= J or E1[i] < E2[j]:
            print(E1[i], "is not in E2")
            i += 1
        elif E1[i] == E2[j]:
            print(E1[i], "is in E2")
            i, j = i + 1, j + 1
        else:
            j += 1
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top