Is there a better way to get just 'important words' from a list in python?

Question

I disagree. Making a list of common words is correct, there is no easier way to filter out the, for, I, am, etc.. However, it is unreasonable to use the common_words list to filter out results that aren't words, because then you'd have to include every possible non-word you don't want. Non-words should be filtered out differently.

Some suggestions:
1) common_words should be a set(), since your list is long this should speed things up. The in operation for sets in O(1), while for lists it is O(n).

2) Getting rid of all number strings is trivial. One way you could do it is:

all([w.isdigit() for w in word])

Where if this returns True, then the word is just a series of numbers.

3) Getting rid of the d... is a little more tricky. It depends on how you define a non-word. This:

tf = [ c.isalpha() for c in word ]

Returns a list of True/False values (where it is False if the char was not a letter). You can then count the values like:

t = tf.count(True)
f = tf.count(False)

You could then define a non-word as one that has more non-letter chars in it than letters, as one that has any non-letter characters at all, etc. For example:

def check_wordiness(word):
    # This returns true only if a word is all letters
    return all([ c.isalpha() for c in word ])

4) In the for word in top_words: block, are you sure that you have not mixed up counter and number? Also, counter and number are pretty much redundant, you could rewrite the last bit as:

for word in top_words:
    # Since you are calling .lower() so much, 
    # you probably want to define it up here
    w = word.lower() 
    if w not in common_words and w not in already:
        # String formatting is preferred over +'s
        print "%i. '%s'" % (number, word)
        number +=1
    # This could go under the if statement. You only want to add
    # words that could be added again.  Why add words that are being
    # filtered out anyways?
    already.append(w)

    # this wasn't indented correctly before
    if number == many:
        break

Hope that helps.

Is there a better way to get just 'important words' from a list in python?

10. '158'