Extracting only meaningful text from webpages

Question 1

I think what you are looking for is the stopwords.words from nltk.corpus:

>>> from nltk.corpus import stopwords
>>> sw = set(stopwords.words('english'))
>>> sentence = "a long sentence that contains a for instance"
>>> [w for w in sentence.split() if w not in sw]
['long', 'sentence', 'contains', 'instance']

Edit: searching for stopword give possible duplicates: Stopword removal with NLTK, How to remove stop words using nltk or python. See the answers of these question. And consider Effects of Stemming on the term frequency? too

Question 2

While you might get robust lists of stop-words in NLTK (and elsewhere), you can easily build your own lists according to the kind of data (register) you process. Most of the words you do not want are so-called grammatical words: they are extremely frequent, so you catch them easily by sorting a frequency list by descending order and discarding the n-top items.

In my experience, the first 100 ranks of any moderately large corpus (>10k tokens of running text) hardly contain any content words.

It seems that you are interested in extracting keywords, however. For this task, pure frequency signatures are not very useful. You will need to transform the frequencies into some other value with respect to a reference corpus: this is called weighting and there are many different ways to achieve it. TfIdf is the industry standard since 1972.

If you are going to spend time doing these tasks, get an introductory handbook for corpus linguistics or computational linguistics.

Question 3

You can look for available corpora linquistics for data on frequency of words (along with other annotations).

You can start from links on wikipedia: http://en.wikipedia.org/wiki/Corpus_linguistics#External_links

More information you can probably find at https://linguistics.stackexchange.com/