I'm not too experienced with pandas, but you can do something like this. This method keeps a dict where the keys are the words and the values are a set of all IDs each word appeared in.
wc = defaultdict(int)
idc = defaultdict(set)
for ID, words in zip(df.ID, df.words):
lwords = words.split()
for word in lwords:
wc[word] += 1
# You don't really need the if statement (since a set will only hold one
# of each ID at most) but I feel like it makes things much clearer.
if ID not in idc[word]:
idc[word].add(ID)
After this idc looks like:
defaultdict(<type 'set'>, {'kiwi': set(['c']), 'strawberry': set(['a']), 'lemon': set(['a', 'c']), 'apple': set(['a']), 'banana': set(['a', 'c', 'b'])})
So you'll have to get the length of each set. I used this:
lenidc = dict((key, len(value)) for key, value in idc.iteritems())
After adding lenidc.values() as a key to dwc, and initializing dfwc, I got:
count ids word
0 2 1 kiwi
1 1 1 strawberry
2 3 2 lemon
3 3 1 apple
4 4 3 banana
The pit fall of this method is that it uses two separate dicts (wc and idc), and the keys (words) in them are not guaranteed to be in the same order. So, you'll want to merge the dicts together to eliminate this problem. This is how I did it:
# Makes it so the values in the wc dict are a tuple in
# (word_count, id_count) form
for key, value in lenidc.iteritems():
wc[key] = (wc[key], value)
# Now, when you construct dwc, for count and id you only want to use
# the first and second columns respectively.
dwc = {"word": Series(wc.keys()),
"count": Series([v[0] for v in wc.values()]),
"ids": Series([v[1] for v in wc.values()])}