Mahout - Naive Bayes Model Very Slow

Question 1

"but i fear losing accuracy" Have you actually tried using less features or less documents? You may not lose as much accuracy as you fear. There may be a few things at play here:

Such a high number of documents are not likely to be from the same time period. Over time, the content of a stream will inevitably drift and words indicative of one class may become indicative of another. In a way, adding data from this year to a classifier trained on last year's data is just confusing it. You may get much better performance if you train on less data.
The majority of features are not helpful, as @Anony-Mousse said already. You might want to perform some form of feature selection before you train your classifier. This will also speed up training. I've had good results in the past with mutual information.

I've previously trained classifiers for a data set of similar scale and found the system worked best with only 200k features, and using any more than 10% of the data for training did not improve accuracy at all.

PS Could you tell us a bit more about your problem and data set?

Edit after question was updated: Clustering is a good way of selecting representative documents, but it will take a long time. You will also have to re-run it periodically as new data come in.

I don't think edit distance is the way to go. Typical algorithms are quadratic in the length of the input strings, and you might have to run for each pair of words in the corpus. That's a long time!

I would again suggest that you give random sampling a shot. You say you are concerned about accuracy, but are using Naive Bayes. If you wanted the best model money can buy, you would go for a non-linear SVM, and you probably wouldn't live to see it finish training. People resort to classifiers with known issues (there's a reason Naive Bayes is called Naive) because they are much faster than the alternative but performance will often be just a tiny bit worse. Let me give you an example from my experience:

RBF SVM- 85% F1 score - training time ~ month
Linear SVM- 83% F1 score - training time ~ day
Naive Bayes- 82% F1 score - training time ~ day

You find the same thing in the literature: paper . Out of curiosity, what kind of accuracy are you getting?

Question 2

Prune stopwords and otherwise useless words (too low support etc.) as early as possible.

Depending on how you use clustering, it may actually make in particular the test phase even more expensive.

Try other tools than Mahout. I found Mahout to be really slow in comparison. It seems that it somewhere comes at a really high overhead.

Question 3

Using less training exampes would be an option. You will see that after a specific amount of training examples you classification accuracy on unseen examples won't increase. I would recommend to try to train with 100, 500, 1000, 5000, ... examples per category and using 20% for cross validating the accuracy. When it doesn't increase anymore, you have found the amount of data you need which may be a lot less then you use now.

Another approach would be to use another library. For document-classification i find liblinear very very very fast. It's may be more low-level then mahout.