Question

I have a file containing space separated numbers in each line. Each line corresponds to a list of numbers.
Now there are around 300,000 such lines (each line containing around 100 numbers on an average).
I want to find mutual intersection of all such lists i.e. first list intersected with all other lists, then second list intersected with all other lists and so on.
I am using

set(a) & set(b)  

where a and b are lists I get iterating in a double loop.
But this is taking too much time. For ex : for first list intersected with all other lists, it took about 3 minutes.
How can I do this efficiently? (may be with some other language/tool)

Was it helpful?

Solution

You should use generator expressions here, they do lazy evaluation and save a lot of memory:

In [46]: from itertools import imap

In [47]: a = [[1,2,3], [2,3,4], [3,4,5]]

In [48]: reduce(set.intersection,imap(set,a))
Out[48]: set([3])

considering your file looks like this:

1 2 3
2 3 4
3 4 5

code: Using itertools.combinations():

with open("abc.txt") as f:
    lines=(map(int,x.split()) for x in f)
    for x in combinations(lines,2):
        print x,'-->',reduce(set.intersection,imap(set,x))
   ....:         
([1, 2, 3], [2, 3, 4]) --> set([2, 3])
([1, 2, 3], [3, 4, 5]) --> set([3])
([2, 3, 4], [3, 4, 5]) --> set([3, 4])

OTHER TIPS

First idea to come would be to first build all the sets once, if it all fit in memory, then intersect them.

If you really need all the intersections of 300000 lines with 300000 lines, it will take time anyway. Maybe you should rethink your problem.

I think you can optimize this by creating an inverted index, that is, a mapping number => list of rows that contain this number. For example, if 10 occurs on rows 5, 100, 200 then you'll have

10: [5, 100, 200]

To further optimize this, you can store the rowlist as a set of pairs:

10: set( (5,100), (5,200), (100,200) )

Then, to compute an intersection of list_a + list_b, just find all numbers whose associated rowlist contains (list_a, list_b).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top