how to classify text based on more than one column
-
12-12-2020 - |
Вопрос
I passages of text to classify by topic. I am using scikit learn, e.g. linear svc, but open to other options. Currently, use only the text of each passage (column labeled "En"
below). But I feel that using the title of each passage (column labeled "Ref"
below) would help.
Ref En Topics
3 Gittin modifier meaning board referred unique name mo... dinei-haget
11 Even HaEzer hand katafres explanation hand slanted obvious... dinei-haget
67 Rest on Holiday similar two baskets untithed fruit front first... laws-of-holidays
118 Beitzah mishna states one ate food prepared festival e... laws-of-holidays
131 Sabbath one may mix water salt oil dip one bread put c... rabbinically-forbidden-activities-on-shabbat
One excellent idea, which @Erwan mentioned in the comments, is to simply include the title together with the text of the passage. But I have two issues with that:
- How do I do that?
- Don't I want to make the title (or
"Ref"
) count for more weight than the other words in the passage?
Решение
In order to make the model take into account the two columns and distinguish whether a word is from the title or the article, you can generate the two sets of features independently and then concatenate the two vectors of features. This way the learning algorithm can assign different weights to a particular word depending on whether it comes from the title or the article.
However there is a disadvantage in doing that: depending on how many instances, how many words etc. there are, increasing the number of features might make the task harder for the model (for instance cause overfitting). It's probably worth testing the two options:
- concatenate the text from the two columns then generate the features, i.e. no distinction between the columns but simpler job for the classifier
- generate the features independently then concatenate the two sets of features (as above).