NLP and one-class classifier building

https://datascience.stackexchange.com/questions/64427

19-10-2020
|

Pergunta

I have a big dataset containing almost 0.5 billions of tweets. I'm doing some research about how firms are engaged in activism and so far, I have labelled tweets which can be clustered in an activism category according to the presence of certain hashtags within the tweets.

Now, let's suppose firms are tweeting about an activism topic without inserting any hashtag in the tweet. My code won't categorized it and my idea was to run a SVM classifier with only one class.

This lead to the following question:

Is this solution data-scientifically feasible?
Does exists any other one-class classifier?
(Most important of all) Are there any other ways to find if a tweet is similar to the ensable of tweets containing activism hashtags?

Solução

Yes, this is feasible.
One-class classification is a thing, but it is usually used in a context where it is hard or impossible to get negative samples. In your case, I would argue, you can quite easily get tweets that are not about activism, therefore you can render it as a binary classification, because you have data points of two classes or labels: 1 for tweets that are part of your class and another 1 for tweets that are not.
There are many ways to build a classifier, SVM is only one of them. You could also use a Naive Bayes algorithm, or as @Kasra mentioned a neural network model. No matter what you use, you will have to organise your data such that you have samples of both classes: activism and non-activism within your set. This means that you should randomly pick tweets from your big dataset and manually check if they relate to activism, even if they don't have the hashtags in them that you used for identifying the activism tweets in the beginning. Further, you have to think about the features that your classifier will use. The simplest might be the bag of words within the tweets, but you might also pre-process the tweet to exclude stop-words. Depending on which algorithm you use, you might find that your classifier relies a lot on the presence of your particular hashtags as features for predicting the class. In this case it might struggle to identify other tweets without this hashtags as activism, even if they are activism. I would experiment with pre-processing the tweets in your entire dataset to remove those hashtags from the the tweets.

Outras dicas

Yes. one-class SVM is actually designed for your problem. The question it answers is "how similar a new sample point (unlabeled tweet) is to my training data (hash-tagged tweets)?"
Regardless of what is a good answer to this question, I can share my brainstorming. Try to find the answer of "How can I model my data in a way that activism tweets stick together and separated from other tweets?". A way would be to find some Activism-specific dictionary and using that dictionary for modeling dataset with TF-IDF. To do that, you can use non-activism text (just find a corpus of text about mathematics!) and subtract the set of its vocabulary from your activism vocabulary. Remaining can give you a good idea about activism "key-words". Please not that if the activism topic is something in the concept and not much about keywords, then you need more sophisticated language models e.g. BERT. In that case use your activism tweets as positive examples and create negative samples (e.g. from that Math corpus) and use Sequence Classification.
I just realized that in (2) I answered actually (3)!

Hope it helps!

From OP's comment:

I want to find out if an unlabelled tweet has to categorized as activism or not according to the labelled data I already have (the ones containing activism hashtags)

This could correspond to a semi-supervised learning setting along the lines of:

Train a model on a labelled sample of data, e.g. taking tweets with #activism as positive instances and assuming the others are negative for now
Apply the model to the rest of the data (unlabelled instances)

In order to maximize the accuracy this process can be iterated in the following way: take the instances predicted as very likely positive and the ones very likely negative as a new training set, and repeat the process until convergence (i.e. very little change in the predictions).

Btw there are example of the one-class learning approach (which is different) for the problem of authorship verification, which has some similarities to this one.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange