Question

I'm working on Sentiment Analysis for text classification and I want to classify tweets from Twitter to 3 categories, positive, negative, or neutral. I have 210 training data, and I'm using Naive Bayes as classifier. I'm implementing using PHP and MySQL as my database for training data. What I've done is in sequence :

  1. I split my training data based on 10-fold Cross Validation to 189 training data and 21 testing data.
  2. I insert my training data into database, so my classifier can classify based on training data
  3. Then I classify my testing data using my classifier. I got 21 prediction results.
  4. Repeat step 2 and 3 for 10 times based on 10-fold Cross Validation
  5. I evaluate the accuracy of the classifier one by one, so I got 10 accuracy results. Then I take the average of the results.

What i want to know is :

  1. Which is the learning process ? What is the input, process, and output ?
  2. Which is the validation process ? What is the input, process, and output ?
  3. Which is the testing process ? What is the input, process, and output ?

I just want to make sure that my comprehension about these 3 process (learning, validation, and testing) is the right one.

Was it helpful?

Solution

In your example, I don't think there is a meaningful distinction between validation and testing.

  • Learning is when you train the model, which means that your outputs are, in general, parameters, such as coefficients in a regression model or weights for connections in a neural network. In your case, the outputs are estimated probabilities for the probability of seeing a word w in a tweet given the tweet positive P(w|+), seeing a word given negative P(w|-), and seeing a word given neutral P(w|*). Also the probabilities of not seeing words in the tweet given positive, negative, neutral, P(~w|+), etc. The inputs are the training data, and the process is simply estimating probabilities by measuring the frequencies that words occur (or don't occur) in each of your classes, i.e just counting!

  • Testing is where you see how well your trained model does on data you haven't seen before. Training tends to produce outputs that overfit the training data, i.e. the coefficients or probabilities are "tuned" to noise in the training data, so you need to see how well your model does on data it hasn't been trained on. In your case, the inputs are the test examples, the process is applying Bayes theorem, and the outputs are classifications for the test examples (you classify based on which probability is highest).

I have come across cross-validation -- in addition to testing -- in situations where you don't know what model to use (or where there are additional, "extrinsic", parameters to estimate that can't be done in the training phase). You split the data into 3 sets.

So, for example, in linear regression you might want to fit a straight line model, i.e. estimate p and c in y = px + c, or you might want to fit a quadratic model, i.e. estimate p, c, and q in y = px + qx^2 + c. What you do here is split your data into three. You train the straight line and quadratic models using part 1 of the data (the training examples). Then you see which model is better by using part 2 of the data (the cross-validation examples). Finally, once you've chosen your model, you use part 3 of the data (the test set) to determine how good your model is. Regression is a nice example because a quadratic model will always fit the training data better than the straight line model, so can't just look at the errors on the training data alone to decide what to do.

In the case of Naive Bayes, it might make sense to explore different prior probabilities, i.e. P(+), P(-), P(*), using a cross-validation set, and then use the test set to see how well you've done with the priors chosen using cross-validation and the conditional probabilities estimated using the training data.


As an example of how to calculate the conditional probabilities, consider 4 tweets, which have been classified as "+" or "-" by a human

  • T1, -, contains "hate", "anger"
  • T2, +, contains "don't", "hate"
  • T3, +, contains "love", "friend"
  • T4, -, contains "anger"

So for P(hate|-) you add up the number of times hate appears in negative tweets. It appears in T1 but not in T4, so P(hate|-) = 1/2. For P(~hate|-) you do the opposite, hate doesn't appear in 1 out of 2 of the negative tweets, so P(~hate|-) = 1/2.

Similar calculations give P(anger|-) = 1, and P(love|+) = 1/2.

A fly in the ointment is that any probability that is 0 will mess things up in the calculation phase, so you instead of using a zero probability you use a very low number, like 1/n or 1/n^2, where n is the number of training examples. So you might put P(~anger|-) = 1/4 or 1/16.

(The maths of the calculation I put in this answer).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top