Question

So here is my problem. I have a very large csv file that has 3 columns. The first column is unique ids. The second column is a string that is an english sentence. The third column is a string of word tags that describe the sentence in the second column (usually 3 tags, max of 5). Here is an example.

id | sentence                       | tags
1  | "people walk dogs in the park" | "pet park health"
2  | "I am allergic to dogs"        | "allergies health"

What I want to do is find all of the co-occurrences of tag words with words in sentences. So the desired output for the above example would look something like.

("walk","pet"),1
("health","dogs"),2
("allergies","dogs"),1
etc...

where the first entry is a word pair (the first from the sentence, the second is a tag word) and then the number of times they co-occur.

I am wondering what the best way to do this is. I was thinking perhaps I could come up with a python dictionary where the key is a tag word and the value is the set of ids where that tag word appears. I could do the same with all of the words that appear in all sentences (after removing stop-words). Then I could count the number of ids in the intersection of both sets for every combination of the two words which would give me the number of times they co-occur.

However, this seems like it would take a very long time (huge csv file!). I also might run out of memory. Can anyone think of a better way to do this. Maybe import the file into a database and run some sort of query?

Was it helpful?

Solution

I think it's easy with itertools.product() and collections.Counter():

import csv
from itertools import product
from collections import Counter

rdr = csv.reader(open(r"data.csv"), quotechar='"',delimiter='|')
c = Counter((x, y) for _, a, b in rdr for x, y in product(a.split(), b.split()))

As for processing huge file, I think you can try some kind of map-reduce - read csv line by line and save all combinations into another file:

with open(r"data.csv") as r, open(r"data1.csv", "w") as w:
    rdr = csv.reader(r, quotechar='"', delimiter='|')
    for _, a, b in rdr:
        for x, y in product(a.split(), b.split()):
            w.write("{},{}\n".format(x, y))

Next step would be to read second file and create counter:

with open(r"c:\temp\data1.csv") as r:
    for l in r:
        c[l.rstrip('\n')] += 1

update I've started to see is there any map-reduce framework for Python. Here's first link by googling - Disco map-reduce framework. Actually it has a tutorial which shows how to create and run a Disco job that counts words - I think it could be useful for you (at least I will go and give it a try :) ). And another one - https://github.com/michaelfairley/mincemeatpy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top