How to include categorical fields to enhance a text classification
-
13-12-2020 - |
문제
I would have a question on how to add more categorical fields in a classification problem. My dataset had initially 4 fields:
Date Text Short_Mex Username Label
01/01/2020 I am waiting for the TRAIN A train is coming Ludo 1
01/01/2020 you need to keep distance Social Distance is mandatory wgriws 0
...
02/01/2020 trump declared war against CHINESE technology China’s technology is out of the games Fwu32 1
...
I joined this dataset to a new one with labels, having values 1 or 0. This will need for classification.
However I have extracted also other fields from my original dataset such as number of characters, upper case words, top frequent terms, and so on. Some of these fields may be useful for a classification, since I can assign more ‘weight’ based on a word in upper case rather than lower case.
So I would need to use a new dataset with these fields:
Date Text Short_Mex Username Upper Label
01/01/2020 I am waiting for the TRAIN A train is coming Ludo [TRAIN] 1
01/01/2020 you need to keep distance Social Distance is mandatory wgriws [] 0
...
02/01/2020 trump declared war against CHINESE technology China’s technology is out of the games Fwu32 [CHINESE] 1
...
I would like to ask you how to add this information (upper case) as a new info for my classifier. What I am doing is currently the following:
#Train-test split
x_train,x_test,y_train,y_test = train_test_split(df['Text'], news.target, test_size=0.2, random_state=1)
#Logistic regression classification
pipe1 = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', LogisticRegression())])
model_lr = pipe1.fit(x_train, y_train)
lr_pred = model_lr.predict(x_test)
print("Accuracy of Logistic Regression Classifier: {}%".format(round(accuracy_score(y_test, lr_pred)*100,2)))
print("\nConfusion Matrix of Logistic Regression Classifier:\n")
print(confusion_matrix(y_test, lr_pred))
print("\nCLassification Report of Logistic Regression Classifier:\n")
print(classification_report(y_test, lr_pred))
해결책
Scikit-learn has compose.ColumnTransformer which
allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
A demo of mixing numeric and categorical types is here. In your example, CountVectorizer
is numeric and Label
is categorical.