Question

I've seen a number of questions on making histograms in clean one-liners, but I haven't yet found anyone trying to make them as efficiently as possible. I'm currently creating a lot of tfidf vectors for a search algorithm, and this involves creating a number of histograms and my current code, while being very short and readable is not as fast as I would like. Sadly, I've tried a number of other methods that turned out far slower. Can you do it faster? cleanStringVector is a list of strings (all lowercase, no punctuation), and masterWordList is also a list of words that should contain every word within the cleanStringVector.

from collections import Counter
def tfidfVector(cleanStringVector, masterWordList):
    frequencyHistogram = Counter(cleanStringVector)
    featureVector = [frequencyHistogram[word] for word in masterWordList]
    return featureVector

Worth noting that the fact that the Counter object returns a zero for non-existent keys instead of raising a KeyError is a serious plus and most of the histogram methods in other questions fail this test.

Example: If I have the following data:

["apple", "orange", "tomato", "apple", "apple"]
["tomato", "tomato", "orange"]
["apple", "apple", "apple", "cucumber"]
["tomato", "orange", "apple", "apple", "tomato", "orange"]
["orange", "cucumber", "orange", "cucumber", "tomato"]

And a master wordlist of:

["apple", "orange", "tomato", "cucumber"]

I would like a return of the following from each test case respectively:

[3, 1, 1, 0]
[0, 1, 2, 0]
[3, 0, 0, 1]
[2, 2, 2, 0]
[0, 2, 1, 2]

I hope that helps.

Approximate final results:

Original Method: 3.213
OrderedDict: 5.529
UnorderedDict: 0.190
Was it helpful?

Solution

This improves the runtime in my unrepresentative micro benchmark by 1 order of magnitude with Python 3:

mapping = dict((w, i) for i, w in enumerate(masterWordList))

def tfidfVector(cleanStringVector, masterWordList):    
    featureVector = [0] * len(masterWordList)
    for w in cleanStringVector:
        featureVector[mapping[w]] += 1
    return featureVector

OTHER TIPS

I think looping through the Master Word list is a problem. Each time you make a histogram you have to hash every word in the master word list (most of these hashes are just missing, a computationally expensive way to return a 0).

I would hash the master wordlist first, then use that hash to create each histogram, this way you only need to hash every word in the stringvector (twice, once to get the counts, and once to reset the master wordlist hash). If the stringvectors are smaller than the master wordlist, this results is many fewer hashing operations:

from itertools import repeat

stringvecs=[["apple", "orange", "tomato", "apple", "apple"],
["tomato", "tomato", "orange"],
["apple", "apple", "apple", "cucumber"],
["tomato", "orange", "apple", "apple", "tomato", "orange"],
["orange", "cucumber", "orange", "cucumber", "tomato"]]

m=["apple", "orange", "tomato", "cucumber"]

md = dict(zip(m, repeat(0)))

def tfidfVector(stringvec, md):
    for item in stringvec:
        md[item]+=1
    out=md.values()
    for item in stringvec:
        md[item]=0
    return out

for stringvec in stringvecs:
    print tfidfVector(stringvec, md)

Note: md.values() should be stable as long as we aren't adding keys..

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top