Question

My input data is a (23948,) pandas.Series of strings containing newspaper headlines. My target are 20 labels of the headline (e.g. 'crime', 'politics') each binarily encoded with [0, 1]. The labels are not exclusive, a headline could be about crime and politics at the same time. I would like to compare three algorithms for this problem. I use the following pipeline to predict the labels:

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

parameters = [
    {"clf": [RandomForestClassifier()],
     "clf__n_estimators": [10, 100, 250],
     "clf__max_depth":[8],
     "clf__random_state":[42]},
    {"clf": [LinearSVC()],
     "clf__C": [1.0, 10.0, 100.0, 1000.0],
     "clf__random_state":[42]}
    {"clf": [GaussianNB()]}
]

rkf = RepeatedKFold(
    n_splits=10,
    n_repeats=2,
    random_state=42
)

cv = GridSearchCV(
    pipeline,
    parameters,
    cv=rkf,
    scoring='accuracy',
    n_jobs=-1)

The pipeline works fine for the random forest, but breaks at the LinearSVC() with the following error:

ValueError: bad input shape (20972, 20)

If I remove the LinearSVC(), it stops at:

A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

There is a list "Support multilabel:" in the sci-kit learn documentation on multilabel classification (https://scikit-learn.org/stable/modules/multiclass.html) and only the random forest is included. However, the documentation states that "Multioutput classification support can be added to any classifier with MultiOutputClassifier."

I am bit confused, do LinearSVC() and GaussianNB() support multilabel classification when wrapped in MultiOutputClassifier()? If not, is there a workaround?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top