Question

In addition to counting the frequency of the words in a document, I'd like to count the number of distinct ids the word is associated with. It's easier to explain with an example:

from pandas import *
from collections import defaultdict
d = {'ID' : Series(['a', 'a', 'b', 'c', 'c', 'c']),
  'words' : Series(["apple banana apple strawberry banana lemon",
  "apple", "banana", "banana lemon", "kiwi", "kiwi lemon"])}
df = DataFrame(d)

>>> df
  ID                                       words
0  a  apple banana apple strawberry banana lemon
1  a                                       apple
2  b                                      banana
3  c                                banana lemon
4  c                                        kiwi
5  c                                  kiwi lemon

# count frequency of words using defaultdict
wc = defaultdict(int)
for line in df.words:
linesplit = line.split()
for word in linesplit:
  wc[word] += 1
# defaultdict(<type 'int'>, {'kiwi': 2, 'strawberry': 1, 'lemon': 3, 'apple': 3, 'banana': 4})
# turn in to a DataFrame
dwc = {"word": Series(wc.keys()),
      "count": Series(wc.values())}
dfwc =  DataFrame(dwc)
>>> dfwc
   count        word
0      2        kiwi
1      1  strawberry
2      3       lemon
3      3       apple
4      4      banana

The counting the frequency of words part is simple, as shown above. What I'd like to do is to obtain the output as following, which gives the number of distinct ids associated with each word:

   count        word  ids
0      2        kiwi    1
1      1  strawberry    1
2      3       lemon    2
3      3       apple    1
4      4      banana    3  

Ideally I'd like it to be at the same time as counting the word frequency.. but I'm not sure how I can integrate it.

Any pointer would be much appreciated!

Was it helpful?

Solution

I'm not too experienced with pandas, but you can do something like this. This method keeps a dict where the keys are the words and the values are a set of all IDs each word appeared in.

wc = defaultdict(int)
idc = defaultdict(set)

for ID, words in zip(df.ID, df.words):
    lwords = words.split()
    for word in lwords:
        wc[word] += 1
        # You don't really need the if statement (since a set will only hold one 
        # of each ID at most) but I feel like it makes things much clearer.
        if ID not in idc[word]:
            idc[word].add(ID)

After this idc looks like:

defaultdict(<type 'set'>, {'kiwi': set(['c']), 'strawberry': set(['a']), 'lemon': set(['a', 'c']), 'apple': set(['a']), 'banana': set(['a', 'c', 'b'])})

So you'll have to get the length of each set. I used this:

lenidc = dict((key, len(value)) for key, value in idc.iteritems())

After adding lenidc.values() as a key to dwc, and initializing dfwc, I got:

   count  ids        word
0      2    1        kiwi
1      1    1  strawberry
2      3    2       lemon
3      3    1       apple
4      4    3      banana

The pit fall of this method is that it uses two separate dicts (wc and idc), and the keys (words) in them are not guaranteed to be in the same order. So, you'll want to merge the dicts together to eliminate this problem. This is how I did it:

# Makes it so the values in the wc dict are a tuple in 
# (word_count, id_count) form
for key, value in lenidc.iteritems():
    wc[key] = (wc[key], value)

# Now, when you construct dwc, for count and id you only want to use
# the first and second columns respectively. 
dwc = {"word": Series(wc.keys()), 
       "count": Series([v[0] for v in wc.values()]), 
       "ids": Series([v[1] for v in wc.values()])}

OTHER TIPS

There's probably a slicker way to do this, but I'd approach it in two steps. First, flatten it, and then make a new dataframe with the information we want:

# make a new, flattened object
s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
index = s.index.get_level_values(0)
new = df.ix[index]
new["words"] = s.values

# now group and build 
grouped = new.groupby("words")["ID"]
summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
summary = summary.reset_index().rename(columns={"words": "word"})

which produces

>>> summary
         word  count  ids
0       apple      3    1
1      banana      4    3
2        kiwi      2    1
3       lemon      3    2
4  strawberry      1    1

Step-by-step. We start from the original DataFrame:

>>> df
  ID                                       words
0  a  apple banana apple strawberry banana lemon
1  a                                       apple
2  b                                      banana
3  c                                banana lemon
4  c                                        kiwi
5  c                                  kiwi lemon

Pull apart the multi-fruit elements:

>>> s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
>>> s
0  0         apple
   1        banana
   2         apple
   3    strawberry
   4        banana
   5         lemon
1  0         apple
2  0        banana
3  0        banana
   1         lemon
4  0          kiwi
5  0          kiwi
   1         lemon
dtype: object

Get the indices that align these with the original frame:

>>> index = s.index.get_level_values(0)
>>> index
Int64Index([0, 0, 0, 0, 0, 0, 1, 2, 3, 3, 4, 5, 5], dtype=int64)

And then take the original frame from this perspective:

>>> new = df.ix[index]
>>> new["words"] = s.values
>>> new
  ID       words
0  a       apple
0  a      banana
0  a       apple
0  a  strawberry
0  a      banana
0  a       lemon
1  a       apple
2  b      banana
3  c      banana
3  c       lemon
4  c        kiwi
5  c        kiwi
5  c       lemon

This is something more like what we can work with. In my experience, half the effort is getting your data into the right format to start with. After this, it's easy:

>>> grouped = new.groupby("words")["ID"]
>>> summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
>>> summary
            count  ids
words                 
apple           3    1
banana          4    3
kiwi            2    1
lemon           3    2
strawberry      1    1
>>> summary = summary.reset_index().rename(columns={"words": "word"})
>>> summary
         word  count  ids
0       apple      3    1
1      banana      4    3
2        kiwi      2    1
3       lemon      3    2
4  strawberry      1    1

Note that we could have found this information simply by using .describe():

>>> new.groupby("words")["ID"].describe()
words             
apple       count     3
            unique    1
            top       a
            freq      3
banana      count     4
            unique    3
            top       a
            freq      2
kiwi        count     2
            unique    1
            top       c
            freq      2
lemon       count     3
            unique    2
            top       c
            freq      2
strawberry  count     1
            unique    1
            top       a
            freq      1
dtype: object

And we could alternatively have started from this and then pivoted to get your desired output.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top