How to include categorical fields to enhance a text classification

https://datascience.stackexchange.com/questions/80994

13-12-2020
|

質問

I would have a question on how to add more categorical fields in a classification problem. My dataset had initially 4 fields:

Date             Text                            Short_Mex                        Username        Label
01/01/2020       I am waiting for the TRAIN      A train is coming                Ludo       1
01/01/2020       you need to keep distance       Social Distance is mandatory     wgriws    0
...
02/01/2020       trump declared war against CHINESE technology      China’s technology is out of the games      Fwu32      1

...

I joined this dataset to a new one with labels, having values 1 or 0. This will need for classification.

However I have extracted also other fields from my original dataset such as number of characters, upper case words, top frequent terms, and so on. Some of these fields may be useful for a classification, since I can assign more ‘weight’ based on a word in upper case rather than lower case.

So I would need to use a new dataset with these fields:

  Date             Text                            Short_Mex                        Username    Upper    Label
    01/01/2020       I am waiting for the TRAIN      A train is coming                Ludo    [TRAIN]       1
    01/01/2020       you need to keep distance       Social Distance is mandatory     wgriws       []      0
    ...
    02/01/2020       trump declared war against CHINESE technology      China’s technology is out of the games      Fwu32    [CHINESE]       1
...

I would like to ask you how to add this information (upper case) as a new info for my classifier. What I am doing is currently the following:

#Train-test split
x_train,x_test,y_train,y_test = train_test_split(df['Text'], news.target, test_size=0.2, random_state=1)




    #Logistic regression classification
    pipe1 = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', LogisticRegression())])
    
    model_lr = pipe1.fit(x_train, y_train)

lr_pred = model_lr.predict(x_test)

print("Accuracy of Logistic Regression Classifier: {}%".format(round(accuracy_score(y_test, lr_pred)*100,2)))
print("\nConfusion Matrix of Logistic Regression Classifier:\n")
print(confusion_matrix(y_test, lr_pred))
print("\nCLassification Report of Logistic Regression Classifier:\n")
print(classification_report(y_test, lr_pred))

解決

Scikit-learn has compose.ColumnTransformer which

allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

A demo of mixing numeric and categorical types is here. In your example, CountVectorizer is numeric and Label is categorical.

ライセンス： CC-BY-SA と帰属

所属していません datascience.stackexchange