Best way to compare two large sets of strings in Python

Question 1

Firstly, 2000 * 100 chars is'nt that big, you could use a set directly.

Secondly, if your strings are sorted, there is a quick way (which I found here)to compare them, as follows:

def compare(E1, E2):
    i, j = 0, 0
    I, J = len(E1), len(E2)
    while i < I:
        if j >= J or E1[i] < E2[j]:
            print(E1[i], "is not in E2")
            i += 1
        elif E1[i] == E2[j]:
            print(E1[i], "is in E2")
            i, j = i + 1, j + 1
        else:
            j += 1

It is certainly slower than using a set, but it doesn't need the strings to be hold into memory (only two are needed at the same time).

For the Levenshtein thing, there is a C module which you can find on Pypi, and which is quite fast.

Question 2

As mentioned in the comments:

def compare(A, B):
    return list(set(A).intersection(B))

Question 3

This is a modified version of the function that @michaelmeyer presented here https://stackoverflow.com/a/17264117/362951 - in his answer to the question on top of the page we are on.

The modified version below works also on unsorted data, because the function now includes the sorting.

This should not be a performance or resource problem in many cases, because python sorting is very effective. And presorting also helps.

Please note that the 'output' is now in sorted order too. This will differ from the original order of the first parameter, if it was unsorted.

Otherwise the sorting won't hurt much, even if both data sets are already sorted.

But if you want to suppress the sorting, in case both data sets are known to be sorted in ascending order already, call it like this:

compare(my_data1,my_data2,data_is_sorted=True)

Otherwise:

compare(my_data1,my_data2)

and the function accepts unordered data.

This is the modified version. Only the first two lines were added and a third optional parameter:

def compare(E1, E2, data_is_sorted=False):
    if not data_is_sorted:
        E1=sorted(E1)
        E2=sorted(E2)
    i, j = 0, 0
    I, J = len(E1), len(E2)
    while i < I:
        if j >= J or E1[i] < E2[j]:
            print(E1[i], "is not in E2")
            i += 1
        elif E1[i] == E2[j]:
            print(E1[i], "is in E2")
            i, j = i + 1, j + 1
        else:
            j += 1