Question

I am creating my own implementation of a Naïve Bayes classifier. While it’s behaviour and functionalities are clear to me, my concerns are on the nature of the training and testing data.

I acquired several sets of product reviews from Amazon. The first thing I do is parsing them, that is, taking the rating (1 to 5 stars) and the text, which I parse with a regex to only contain alphabetical lowercase characters and spaces. Next, I convert ratings to polar values, so 1 and 2 stars become “-“ and 4 and 5 stars become “+”. I’m intentionally skipping reviews with 3 stars; could this be an issue?

Here come my real concerns. When using a percentage split to generate training and testing sets, should both of them contain the same share of positive and negative reviews (such as 7 positive and 7 negative reviews for training and 3 positive and 3 negative reviews for testing)? Right now I’m acquiring as many positive as negative reviews from the chosen set, but I’m wondering if that should be the case. For instance, if a set contains 7 positive reviews and 4 negative ones, I discard 3 positive reviews to equate them.

Furthermore, I observed that negative reviews tend to contain longer texts on average. So, if I’m using an equal number of positive and negative reviews, but they differ on average text length, would this have an impact on the way my classifier tries to predict?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top