order of features for model tuning vs model fitting

https://datascience.stackexchange.com/questions/67834

08-12-2020
|

Pergunta

Assuming that the same columns (i.e., features) are used for hyperparameter tuning and model fitting, and ensemble models are used for modeling (e.g., Random forest or XGboost), then does the order of columns used during the hyperparameter tuning process should be identical to the order of columns used when fitting the model based on the best hyperparameters?

I am using sklearn's make_column_transformer functions in my CV pipeline for the hyperparameter tuning. Unfortunately, this function modifies the order of provided columns when setting the remainder argument to 'passthrough'. Should I ensure that when fitting the model the same order of columns is preserved, or the order does not matter as long as I am using the same features.

Solução

Ah, I was too quick, and misinterpreted your question! At the bottom of this post I'll leave my old answer, answering why the test set needs to have the same column order.

As for column order of the data in hyperparameter selection vs. training a final model, no, I suppose there's no real reason these need to be the same. In a tree model with column subsampling, you're right (in your comment) that the columns will be selected randomly anyway, so the original order doesn't matter. Even if you don't use column subsampling, and even for other models: a model generally won't be using the column order as informative; if anything, it's used as a fallback tiebreaker. (Time series are an obvious exception, but in that case maybe the data isn't tabular in the same way.)

That said, it's still perhaps best practice to use the same pipeline, so that the column order will be the same anyway. sklearn's hyperparameter tuners make this easy, with refit=True by default just refitting the model pipeline on the best hyperparameters found.

Since sklearn operates on numpy arrays and not pandas dataframes (one of the first things in most sklearn steps is conversion to arrays), you need to make sure columns arrive in the same order as the training data. Otherwise the model will mistake values of some features as being different features! Hopefully this will actually break things (wrong feature type, e.g.), but perhaps it will silently make very bad predictions!

This shouldn't be hard if you use pipelines. The make_column_transformer (and all other steps) will be applied to the testing data in the same way as your training data, so the array after these steps will have columns in the right order. (Alas, if you want to dig into the results, attaching names to the columns after the preprocessing parts of the pipeline may be a hassle.)

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange