Most efficient histogram code in python

Question 1

This improves the runtime in my unrepresentative micro benchmark by 1 order of magnitude with Python 3:

mapping = dict((w, i) for i, w in enumerate(masterWordList))

def tfidfVector(cleanStringVector, masterWordList):    
    featureVector = [0] * len(masterWordList)
    for w in cleanStringVector:
        featureVector[mapping[w]] += 1
    return featureVector

Question 2

I think looping through the Master Word list is a problem. Each time you make a histogram you have to hash every word in the master word list (most of these hashes are just missing, a computationally expensive way to return a 0).

I would hash the master wordlist first, then use that hash to create each histogram, this way you only need to hash every word in the stringvector (twice, once to get the counts, and once to reset the master wordlist hash). If the stringvectors are smaller than the master wordlist, this results is many fewer hashing operations:

from itertools import repeat

stringvecs=[["apple", "orange", "tomato", "apple", "apple"],
["tomato", "tomato", "orange"],
["apple", "apple", "apple", "cucumber"],
["tomato", "orange", "apple", "apple", "tomato", "orange"],
["orange", "cucumber", "orange", "cucumber", "tomato"]]

m=["apple", "orange", "tomato", "cucumber"]

md = dict(zip(m, repeat(0)))

def tfidfVector(stringvec, md):
    for item in stringvec:
        md[item]+=1
    out=md.values()
    for item in stringvec:
        md[item]=0
    return out

for stringvec in stringvecs:
    print tfidfVector(stringvec, md)

Note: md.values() should be stable as long as we aren't adding keys..