Counting frequency of word and number of distinct ids associated with the word

Question 1

I'm not too experienced with pandas, but you can do something like this. This method keeps a dict where the keys are the words and the values are a set of all IDs each word appeared in.

wc = defaultdict(int)
idc = defaultdict(set)

for ID, words in zip(df.ID, df.words):
    lwords = words.split()
    for word in lwords:
        wc[word] += 1
        # You don't really need the if statement (since a set will only hold one 
        # of each ID at most) but I feel like it makes things much clearer.
        if ID not in idc[word]:
            idc[word].add(ID)

After this idc looks like:

defaultdict(<type 'set'>, {'kiwi': set(['c']), 'strawberry': set(['a']), 'lemon': set(['a', 'c']), 'apple': set(['a']), 'banana': set(['a', 'c', 'b'])})

So you'll have to get the length of each set. I used this:

lenidc = dict((key, len(value)) for key, value in idc.iteritems())

After adding lenidc.values() as a key to dwc, and initializing dfwc, I got:

   count  ids        word
0      2    1        kiwi
1      1    1  strawberry
2      3    2       lemon
3      3    1       apple
4      4    3      banana

The pit fall of this method is that it uses two separate dicts (wc and idc), and the keys (words) in them are not guaranteed to be in the same order. So, you'll want to merge the dicts together to eliminate this problem. This is how I did it:

# Makes it so the values in the wc dict are a tuple in 
# (word_count, id_count) form
for key, value in lenidc.iteritems():
    wc[key] = (wc[key], value)

# Now, when you construct dwc, for count and id you only want to use
# the first and second columns respectively. 
dwc = {"word": Series(wc.keys()), 
       "count": Series([v[0] for v in wc.values()]), 
       "ids": Series([v[1] for v in wc.values()])}

Question 2

There's probably a slicker way to do this, but I'd approach it in two steps. First, flatten it, and then make a new dataframe with the information we want:

# make a new, flattened object
s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
index = s.index.get_level_values(0)
new = df.ix[index]
new["words"] = s.values

# now group and build 
grouped = new.groupby("words")["ID"]
summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
summary = summary.reset_index().rename(columns={"words": "word"})

which produces

>>> summary
         word  count  ids
0       apple      3    1
1      banana      4    3
2        kiwi      2    1
3       lemon      3    2
4  strawberry      1    1

Step-by-step. We start from the original DataFrame:

>>> df
  ID                                       words
0  a  apple banana apple strawberry banana lemon
1  a                                       apple
2  b                                      banana
3  c                                banana lemon
4  c                                        kiwi
5  c                                  kiwi lemon

Pull apart the multi-fruit elements:

>>> s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
>>> s
0  0         apple
   1        banana
   2         apple
   3    strawberry
   4        banana
   5         lemon
1  0         apple
2  0        banana
3  0        banana
   1         lemon
4  0          kiwi
5  0          kiwi
   1         lemon
dtype: object

Get the indices that align these with the original frame:

>>> index = s.index.get_level_values(0)
>>> index
Int64Index([0, 0, 0, 0, 0, 0, 1, 2, 3, 3, 4, 5, 5], dtype=int64)

And then take the original frame from this perspective:

>>> new = df.ix[index]
>>> new["words"] = s.values
>>> new
  ID       words
0  a       apple
0  a      banana
0  a       apple
0  a  strawberry
0  a      banana
0  a       lemon
1  a       apple
2  b      banana
3  c      banana
3  c       lemon
4  c        kiwi
5  c        kiwi
5  c       lemon

This is something more like what we can work with. In my experience, half the effort is getting your data into the right format to start with. After this, it's easy:

>>> grouped = new.groupby("words")["ID"]
>>> summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
>>> summary
            count  ids
words                 
apple           3    1
banana          4    3
kiwi            2    1
lemon           3    2
strawberry      1    1
>>> summary = summary.reset_index().rename(columns={"words": "word"})
>>> summary
         word  count  ids
0       apple      3    1
1      banana      4    3
2        kiwi      2    1
3       lemon      3    2
4  strawberry      1    1

Note that we could have found this information simply by using .describe():

>>> new.groupby("words")["ID"].describe()
words             
apple       count     3
            unique    1
            top       a
            freq      3
banana      count     4
            unique    3
            top       a
            freq      2
kiwi        count     2
            unique    1
            top       c
            freq      2
lemon       count     3
            unique    2
            top       c
            freq      2
strawberry  count     1
            unique    1
            top       a
            freq      1
dtype: object

And we could alternatively have started from this and then pivoted to get your desired output.