Question

Assuming I have a dict like this:

docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}

And what I want to do is like calculating "Document Frequency": assuming each dictionary item is a document, and I have a specific word, so how many documents contain that word?

I've seen many posts telling me how to calculate frequency, but here if "a" appears twice in document "alpha", I just need the count to be 1. So the "frequency" of "a" should be 1, and "c" should be 2.

I know I can iterate the whole documents dictionary, and add the counter when finding the word in a document. Or I can firstly make the words in every document unique, then combine all the documents and count the word.

But I think there's a better way, a more effective way. Any ideas?

BTW, is there any way I can keep the structure of the dict? In this example, I'd like to get a result of {"alpha": {'c': 2, 'b': 2, 'a': 1}, "bravo": {'c': 2, 'b': 2, 'd': 1}

Update

If here I have just a list (something like [["a", "b", "c", "a", "b"], ["b", "c", "d", "c", "d"]]), how can I get a result list like [[1, 2, 2, 0], [0, 2, 2, 1]].

I've got no idea. The point is to expand each list and assure the order of the terms. Thoughts?

Was it helpful?

Solution

I'd go with your second way using collections.Counter and set.

>>> from collections import Counter
>>> sum((Counter(set(x)) for x in docDict.itervalues()), Counter())
Counter({'c': 2, 'b': 2, 'a': 1, 'd': 1})

Update 1:

>>> c = sum((Counter(set(x)) for x in docDict.itervalues()), Counter())
>>> {k: {k1:c[k1] for k1 in set(v)} for k, v in docDict.iteritems()}
{'alpha': {'a': 1, 'c': 2, 'b': 2}, 'bravo': {'c': 2, 'b': 2, 'd': 1}}

update 2::

If performance is an concern then don't use Counter with sum, here another way to do it. Note that unlike @user2931409 answer I am not keeping a set of words in memory just to get their length, so this is much more memory efficient but slightly slower than their answer.

result = Counter()
for v in docDict.itervalues():
    result.update(set(v))
return result

Timing comparison:

def func1():
    #http://stackoverflow.com/a/22787509/846892
    result = defaultdict(set)
    for k, vlist in docDict.items():
        for v in vlist:
            result[v].add(k)
    return dict(zip(result.keys(), map(lambda x:len(x), result.values())))

def func2():

    result = Counter()
    for v in docDict.itervalues():
        result.update(set(v))
    return result

In [94]: docDict = {''.join(random.choice(lis) for _ in xrange(8)): random.sample(lis, 25)
    ...:   for _ in xrange(70000)}

In [95]: %timeit func1(docDict)
1 loops, best of 3: 380 ms per loop

In [96]: %timeit func2(docDict)
1 loops, best of 3: 591 ms per loop

In [97]: docDict = {''.join(random.choice(lis) for _ in xrange(8)): random.sample(lis, 25)
    ...:   for _ in xrange(10**5)}

In [98]: %timeit func1(docDict)
1 loops, best of 3: 529 ms per loop

In [99]: %timeit func2(docDict)
1 loops, best of 3: 848 ms per loop

In [101]: func1(docDict) == func2(docDict)
Out[101]: True

OTHER TIPS

This is not special one, very ordinary way.

from collections import defaultdict

docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}
result = defaultdict(set)

for k, vlist in docDict.items():
    for v in vlist:
        result[v].add(k)

#Now the result looks like this.
#{'a': set(['alpha']), 'c': set(['alpha', 'bravo']), 'b': set(['alpha', 'bravo']), 'd': set(['bravo'])})

print dict(zip(result.keys(), map(lambda x:len(x), result.values())))
#{'a': 1, 'c': 2, 'b': 2, 'd': 1}

update

Another way... just counting. And changed to use iterator. So it's more faster than above code.

from collections import defaultdict
def func3(docDict):
    result = defaultdict(int)
    for vlist in docDict.itervalues():
        for i in set(vlist):
            result[i] += 1
    return dict(result)
docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}
revDict = {v : sum(1 for l in docDict.values() if v in l)  
        for v in set(x for y in docDict.values() for x in y) }
print revDict

Gives:

{'a': 1, 'c': 2, 'b': 2, 'd': 1}

You can use set to unify the characters in a single document. Then simply Counter() them.

from collections import Counter

docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}

result = reduce(lambda x, y: x + Counter(set(y)), docDict.itervalues(), Counter([]))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top