
My input data is a (23948,) pandas.Series of strings containing newspaper headlines. My target are 20 labels of the headline (e.g. 'crime', 'politics') each binarily encoded with [0, 1]. The labels are not exclusive, a headline could be about crime and politics at the same time. I would like to compare three algorithms for this problem. I use the following pipeline to predict the labels:

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))

parameters = [
    {"clf": [RandomForestClassifier()],
     "clf__n_estimators": [10, 100, 250],
    {"clf": [LinearSVC()],
     "clf__C": [1.0, 10.0, 100.0, 1000.0],
    {"clf": [GaussianNB()]}

rkf = RepeatedKFold(

cv = GridSearchCV(

The pipeline works fine for the random forest, but breaks at the LinearSVC() with the following error:

ValueError: bad input shape (20972, 20)

If I remove the LinearSVC(), it stops at:

A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

There is a list "Support multilabel:" in the sci-kit learn documentation on multilabel classification ( and only the random forest is included. However, the documentation states that "Multioutput classification support can be added to any classifier with MultiOutputClassifier."

I am bit confused, do LinearSVC() and GaussianNB() support multilabel classification when wrapped in MultiOutputClassifier()? If not, is there a workaround?

