Question

I would like to perform a multiclass-multioutput classification task, on vectorized textual data. I started by using a random forest classifier in a multioutput startegy:

    forest = RandomForestClassifier(random_state=1)
    multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
    multi_target_forest.fit(X_train, y_train)
    y_pred_test = multi_target_forest.predict(X_test)

When looking on the feature importance for the individual estimators (multi_target_forest.estimators_ ) I've noticed that some features in my dataset are very relevant and useful for some tasks, but are disrupting for another class. Example:

Task 1: classify documents for Date (q1, q2, q3, q4) Task 2: classify document for Version (preliminary, final, amendment)

For task 1, features related to dates, such as 'April', are very useful. However, for the second task, the feature 'April' gets a high importance but is a consequence of overfitting to a small dataset. Knowing this I would like to actively remove such features.

Is there a way to control which features are used for every task?

I could just explicitly train separate classifiers for every task, but is that equivalent to multioutput-multiclass? or is there some joined probability calculation going on, that I'll be missing?

Thank you!

Was it helpful?

Solution

I don't think its possible.

First, let's see what's the difference in explicitly modeling separate trees for different tasks versus modeling them in joint manner.

Lets suppose we have 2 task with each n classes. In the later case, to be able to jointly model the correlations, one must create new classes which is a subset of all the permutations available from nC2 combination of the classes from the 2 tasks. Now, if it is true that one feature (say feature A) is beneficial for task 1 but not for task 2, then how would one decide whether to use the feature A while figuring out the final classes in both the task? Task 1 wants the feature but Task 2 doesn't, so this creates a deadlock which would prevent us from putting a bias in the tree model to not use feature A for the classification!

So, if you are certainly sure that a particular feature is not beneficial for a particular task the then the simpler and more effective way would be to model them separately as you mentioned.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top