Use dictionaries. For example, a first step would be to read Names_Text_full_clean.csv
only once and store the result in a dictionary, indexed by match[0]
. Because there might be several times the same match[0]
you need to store as a value the list of the possibly multiple match
objects.
import collections
by_sender = collections.defaultdict(list)
file_read = open('Names_Text_full_clean.csv', 'rU')
reader = csv.reader(file_read)
for match in reader:
by_sender[match[0]].append(match)
Then in the nested loops, you can replace
for match in reader:
if sender[1] == match [0]:
with the following loop, which is probably hundreds of times smaller:
for match in by_sender[sender[1]]:
A further optimization would be to not store match
in the list by_sender[match[0]]
, but to store set(match[1:])
. Indeed, you're only going to look if a particular entry (word[1]
in this case) is equal to any one of the items in match[1:]
. Instead of looping to figure this out, it can be done with just word[1] in my_set
.
This is probably enough, but the "final goal" would be to read all three files once only. You would store the content of two of the files in some suitable dictionaries, and do dictionary lookups only when walking over the third file (or "set lookups", which are very fast too).