Question

I'm trying to use binary relevance for multi-label text classification. Here is the data I have:

  • a training set with 6000 short texts (around 500-800 words each) and some labels attached to them (around 4-6 for each text). There are almost 500 different labels in the entire set.
  • a test set with 6000 shorter texts (around 100-200 words each).

The difference of size between my two sets exists because of the source is different.

So, I want to use binary relevance to find the labels of the texts in the test set. To do it, I created a dictionary with all the different words in the entire training set and removed stop words, words who appear only once and words who appear in more that 10% of the texts. I got 14714 different words in my dictionary.
My idea was to create a matrix where each row represents a document and each column a words and each value was the number of occurrence of a word in a document. But with 14714 words and 6000 documents, I will get a matrix of 88 millions of integers! I tried, just to see, to create it and my laptop didn't support it. :)
I even didn't have the time to create my Y matrix and generate a model (I wanted to use a logistic regression) for only one label...

So, my questions are:

  • Was it a good way to make multi-label classification or is there a better method?
  • Is it a problem to have a training from one source and to use it to make a model to predict data from another source? Is the different size of the documents a problem?
  • Did you use logistic regression for this kind of problem?

Thank you!

Edit: I also want to add the most frequent words in my dictionary (after the cleaning part) are common words and totally useless in the field of my research (biology): used, much, two, use, possible, example, ... How can I pass through it?

Was it helpful?

Solution

  1. Just some views/suggestions. After removing the stop words did you stem/lemmatize the text you got? That would probably reduce the number of unique words in your corpus and brings some forms of words to the same level. But caution while using stemming as it sometimes create noise.
  2. Try postagging and see what can be the important tags you want to keep and eliminate the ones that you feel are giving less relevance to the text.
  3. Did you try to find the some important terms from each document by using tf-idf or chi-square method. They might be helpful to see the relevance of terms for each class/document.
  4. See how far can you reduce the dimensions of the matrix after using the above, if not you have already used them and then apply logistic regression or whatever classifier you wanted to.

On your second question.

  1. I don't think having data from different sources should be a problem since you are trying to create model based on keywords, Hopefully it should be able to see. I am not confident on this part.

I had worked on text classification before with bigger documents and larger corpus with not so fruitful results with various models.

Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top