Document Classification using Naive Bayes classifier

Question 1

You should not necessarily sample dataset A to reduce its instances. Several methods are available for efficient learning from imbalanced datasets, such as Majority Undersampling (exactly what you did), Minority Oversampling, SMOTE, and etc. Here is an empirical comparison of these methods: http://machinelearning.org/proceedings/icml2007/papers/62.pdf

Alternatively, you may define a custom cost matrix for the classifier. In other words, assuming B=Positive class, you may define cost(False Positive) < cost(False Negative). In this case, the classifier's output will bias towards the positive class. Here is a very helpful tutorial: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.4418&rep=rep1&type=pdf

Question 2

A lot of this gets into how good "accuracy" is as a measure of performance, and that depends on your problem. If misclassifying "A" as "B" is just as bad/ok as misclassifying "B" as "A", then there is little reason to do anything other than just mark everything as "A", since you know it will reliably get you a 98% accuracy (so long as that unbalanced distribution is representative of the true distribution).

Without knowing your problem (and if accuracy is the measure you should use), the best answer I could give is "it depends on the data set". It is possible that you could get past 99% accuracy with standard naive bays, though it may be unlikely. For Naive Bayes in particular, one thing you could do is to disable the use of priors (the prior is essentially the proportion of each class). This has the effect of pretending that every class is equally likely to occur, though the model parameters will have been learned from uneven amounts of data.

Your proposed solution is a common practice, it sometimes works well. Another practice is to create fake data for the smaller class (how would depend on your data, for text documents I'm not aware of any particularly good way). Another practice is to increase the weights of the data points in the under-represented classes.

You can search for "imbalanced classification" and find a lot more information about these types of problems (they are one of the harder ones).

If accuracy is not actually a good measure for your problem, you can search for more information about "cost sensitive classification" which should be helpful.