The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .
Getting back to the problem itself:
- 10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
- there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
- SVMs are parametrical models, you cannot use a single
C
andgamma
values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so calledgrid search
, - 1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
- finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents
finally why is simple features squashing
It either gives me 0, 1 of course.
it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.