how to classify text based on more than one column

https://datascience.stackexchange.com/questions/76099

12-12-2020
|

Question

I passages of text to classify by topic. I am using scikit learn, e.g. linear svc, but open to other options. Currently, use only the text of each passage (column labeled "En" below). But I feel that using the title of each passage (column labeled "Ref" below) would help.

    Ref               En                                                Topics
3   Gittin            modifier meaning board referred unique name mo... dinei-haget
11  Even HaEzer       hand katafres explanation hand slanted obvious... dinei-haget
67  Rest on Holiday   similar two baskets untithed fruit front first... laws-of-holidays
118 Beitzah           mishna states one ate food prepared festival e... laws-of-holidays
131 Sabbath           one may mix water salt oil dip one bread put c... rabbinically-forbidden-activities-on-shabbat

One excellent idea, which @Erwan mentioned in the comments, is to simply include the title together with the text of the passage. But I have two issues with that:

How do I do that?
Don't I want to make the title (or "Ref") count for more weight than the other words in the passage?

Solution

In order to make the model take into account the two columns and distinguish whether a word is from the title or the article, you can generate the two sets of features independently and then concatenate the two vectors of features. This way the learning algorithm can assign different weights to a particular word depending on whether it comes from the title or the article.

However there is a disadvantage in doing that: depending on how many instances, how many words etc. there are, increasing the number of features might make the task harder for the model (for instance cause overfitting). It's probably worth testing the two options:

concatenate the text from the two columns then generate the features, i.e. no distinction between the columns but simpler job for the classifier
generate the features independently then concatenate the two sets of features (as above).

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange