Natural Language Processing - Features for Text Classification

Question 1

Natural language documents normally contain many words that only appear once, also known as Hapax Legomenon. For example, 44% of distinct words in Moby-Dick only appear once, and 17% twice.

Therefore, including all words from a corpus normally results in an excessive amount of features. In order to reduce the size of this feature space, NLP systems typically employ one or more of the following:

Removal of Stop Words -- for author classification, these are typically short and common words such as is, the, at, which, and so on.
Stemming -- popular stemmers (such as the Porter stemmer) use a set of rules to normalize the inflection of a word. E.g., walk, walking and walks are all mapped to the stem walk.
Correlation/Significance Threshold -- Compute the Pearson Correlation Coefficient or the p-value of each feature with respect to the class label. Then set a threshold, and remove all feature that score a value below that threshold.
Coverage Threshold -- similar to the above threshold, remove all features that do not appear in at least t documents, where t is very small (< 0.05%) with respect to the entire corpus size.
Filtering based on the part of speech -- for example, only considering verbs, or removing nouns.
Filtering based on the type of system -- for example, a NLP system for clinical text may only consider words that are found in a medical dictionary.

For stemming, removing stop words, indexing the corpus, and computing tf_idf or document similarity, I would recommend using Lucene. Google "Lucene in 5 minutes" for some quick and easy tutorials on using lucene.

Question 2

In these types of classification it is important that your vector is not very large, because you can get a lot of zeros in it and that could have bad impact on results because these vectors are too close and it is hard to separate them correctly. Also i would reccomend you not to use every bigrams, choose some with the highest frequency(in you text) to reduce size of your vector and keep enough information. Some artile why it is reccomended : http://en.wikipedia.org/wiki/Curse_of_dimensionality And last but also important is how much data you have, the bigger your vector is the more data you should have.