Using a LinearSVC() for multilabel classification with MultiOutputClassifier() in a pipeline in sci-kit learn
-
02-11-2019 - |
Question
My input data is a (23948,) pandas.Series of strings containing newspaper headlines. My target are 20 labels of the headline (e.g. 'crime', 'politics') each binarily encoded with [0, 1]. The labels are not exclusive, a headline could be about crime and politics at the same time. I would like to compare three algorithms for this problem. I use the following pipeline to predict the labels:
pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(RandomForestClassifier()))
])
parameters = [
{"clf": [RandomForestClassifier()],
"clf__n_estimators": [10, 100, 250],
"clf__max_depth":[8],
"clf__random_state":[42]},
{"clf": [LinearSVC()],
"clf__C": [1.0, 10.0, 100.0, 1000.0],
"clf__random_state":[42]}
{"clf": [GaussianNB()]}
]
rkf = RepeatedKFold(
n_splits=10,
n_repeats=2,
random_state=42
)
cv = GridSearchCV(
pipeline,
parameters,
cv=rkf,
scoring='accuracy',
n_jobs=-1)
The pipeline works fine for the random forest, but breaks at the LinearSVC() with the following error:
ValueError: bad input shape (20972, 20)
If I remove the LinearSVC(), it stops at:
A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
There is a list "Support multilabel:" in the sci-kit learn documentation on multilabel classification (https://scikit-learn.org/stable/modules/multiclass.html) and only the random forest is included. However, the documentation states that "Multioutput classification support can be added to any classifier with MultiOutputClassifier."
I am bit confused, do LinearSVC() and GaussianNB() support multilabel classification when wrapped in MultiOutputClassifier()? If not, is there a workaround?
No correct solution