Question

I would like to ask you how to work on train and test dataset. I have unlabelled data. They are short text (max 100 characters) and I would need to understand their sentiment. To do this, I am manually assigning labels (1,0,-1). However I have more than 2000 text and I would like to find a way to do it automatically, after considering a small labelled set. What I was thinking is to split the dataset into train and test since the beginning and work with the training dataset to label data. Unfortunately, I have not understood how to assign labels to the remaining texts, I.e. how to predict the sentiment of data in the test dataset.

Could you please tell me what would be next steps and, if you have anything that you think can be useful for a better understanding, suggesting an example to follow? Many thanks

Was it helpful?

Solution

You want to manually label some cases and then extend that "manual labeling" to the rest of the data.

This is a supervised learning excercise with prior manual labeling by you.

Let's suppose you have partitioned a random, suitably sized training data set. Now you need to model a classification algorithm via the classical modeling pipeline and use this model to predict the classifaction/label in the rest of the data.

So yes this is easily possible,however modeling a text classification model is non-trivial and you need to understand basic modeling.

Here are the basic steps, please read upon each step you do not know how to do:

  1. Split your data into a training set (you will model on this data), a test set and a target set.

  2. Manually label your training and test set.

  3. Choose the kind of classification algorithm you want to use. You can use classical ML models but this involves heavily tokenizing and transforming your data to numerical components. You could also use more advanced deep learning techniques for text classification like BERT.

  4. Create a suitable transformer to tidy and transform your data into the right format for your chosen algorithm.

  5. Train the model on your manually labeled training data.

  6. Evaluate and optimize your performance with the test data set.

  7. Use the final model to predict labels in your target set.

Understand that the quality of the automatic labeling will only be as good as your manual labels.

OTHER TIPS

The problem you are talking about is unsupervised sentiment analysis. You can try:

  1. VADER: It gives the polarity of the sentence based on which you can tag your training data. But this library has certain limitations - it can't sense sarcasm and sometimes the accuracy is not that great. But for initial understanding, you can check this library.
  2. Text Blob - nltk's library can be used for sentiment analysis(opinion mining). It can do much more than just sentiment analysis.
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top