Domanda

Assuming I have a dict like this:

docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}

And what I want to do is like calculating "Document Frequency": assuming each dictionary item is a document, and I have a specific word, so how many documents contain that word?

I've seen many posts telling me how to calculate frequency, but here if "a" appears twice in document "alpha", I just need the count to be 1. So the "frequency" of "a" should be 1, and "c" should be 2.

I know I can iterate the whole documents dictionary, and add the counter when finding the word in a document. Or I can firstly make the words in every document unique, then combine all the documents and count the word.

But I think there's a better way, a more effective way. Any ideas?

BTW, is there any way I can keep the structure of the dict? In this example, I'd like to get a result of {"alpha": {'c': 2, 'b': 2, 'a': 1}, "bravo": {'c': 2, 'b': 2, 'd': 1}

Update

If here I have just a list (something like [["a", "b", "c", "a", "b"], ["b", "c", "d", "c", "d"]]), how can I get a result list like [[1, 2, 2, 0], [0, 2, 2, 1]].

I've got no idea. The point is to expand each list and assure the order of the terms. Thoughts?

È stato utile?

Soluzione

I'd go with your second way using collections.Counter and set.

>>> from collections import Counter
>>> sum((Counter(set(x)) for x in docDict.itervalues()), Counter())
Counter({'c': 2, 'b': 2, 'a': 1, 'd': 1})

Update 1:

>>> c = sum((Counter(set(x)) for x in docDict.itervalues()), Counter())
>>> {k: {k1:c[k1] for k1 in set(v)} for k, v in docDict.iteritems()}
{'alpha': {'a': 1, 'c': 2, 'b': 2}, 'bravo': {'c': 2, 'b': 2, 'd': 1}}

update 2::

If performance is an concern then don't use Counter with sum, here another way to do it. Note that unlike @user2931409 answer I am not keeping a set of words in memory just to get their length, so this is much more memory efficient but slightly slower than their answer.

result = Counter()
for v in docDict.itervalues():
    result.update(set(v))
return result

Timing comparison:

def func1():
    #http://stackoverflow.com/a/22787509/846892
    result = defaultdict(set)
    for k, vlist in docDict.items():
        for v in vlist:
            result[v].add(k)
    return dict(zip(result.keys(), map(lambda x:len(x), result.values())))

def func2():

    result = Counter()
    for v in docDict.itervalues():
        result.update(set(v))
    return result

In [94]: docDict = {''.join(random.choice(lis) for _ in xrange(8)): random.sample(lis, 25)
    ...:   for _ in xrange(70000)}

In [95]: %timeit func1(docDict)
1 loops, best of 3: 380 ms per loop

In [96]: %timeit func2(docDict)
1 loops, best of 3: 591 ms per loop

In [97]: docDict = {''.join(random.choice(lis) for _ in xrange(8)): random.sample(lis, 25)
    ...:   for _ in xrange(10**5)}

In [98]: %timeit func1(docDict)
1 loops, best of 3: 529 ms per loop

In [99]: %timeit func2(docDict)
1 loops, best of 3: 848 ms per loop

In [101]: func1(docDict) == func2(docDict)
Out[101]: True

Altri suggerimenti

This is not special one, very ordinary way.

from collections import defaultdict

docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}
result = defaultdict(set)

for k, vlist in docDict.items():
    for v in vlist:
        result[v].add(k)

#Now the result looks like this.
#{'a': set(['alpha']), 'c': set(['alpha', 'bravo']), 'b': set(['alpha', 'bravo']), 'd': set(['bravo'])})

print dict(zip(result.keys(), map(lambda x:len(x), result.values())))
#{'a': 1, 'c': 2, 'b': 2, 'd': 1}

update

Another way... just counting. And changed to use iterator. So it's more faster than above code.

from collections import defaultdict
def func3(docDict):
    result = defaultdict(int)
    for vlist in docDict.itervalues():
        for i in set(vlist):
            result[i] += 1
    return dict(result)
docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}
revDict = {v : sum(1 for l in docDict.values() if v in l)  
        for v in set(x for y in docDict.values() for x in y) }
print revDict

Gives:

{'a': 1, 'c': 2, 'b': 2, 'd': 1}

You can use set to unify the characters in a single document. Then simply Counter() them.

from collections import Counter

docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}

result = reduce(lambda x, y: x + Counter(set(y)), docDict.itervalues(), Counter([]))
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top