Question

I am very confused as to how training data is split and on what data level 0 predictions are made when using generalized stacking. This question is similar to mine, but the answer is not sufficiently clear:

How predictions of level 1 models become training set of a new model in stacked generalization.

My understanding is that the training set is split, base models trained on one split, and predictions are made on the other. These predictions now become features of a new dataset. One column for each model's prediction, plus a column that contains the ground truth for these predictions.

  1. Split training data into train/test.
  2. Train base models on training split.
  3. Make predictions on test split (according to linked answer, use k-fold CV for this).
  4. Create a feature for each model, filling it with that model's predictions
  5. Create a feature for the ground truth of those predictions.
  6. Create a new model and train it on these predictions and ground truth features.

Question 1: Are these the only features used to train the "meta" model? In other words, are none of the actual features of the original data included? The linked answer says it is common to include the original data, but I have not read about it elsewhere.

Question 2: If the above algorithm is correct, What is the form of the data when making predictions? It seems like it would also have to have predictions as independent variables. If so, that means running all new incoming data through all base models again, right?

Question 3: I keep seeing an "out-of-fold" requirement for the first level predictions. It seems that doing a simple train/test split as mentioned above would fulfill this. However, would you not want a 3rd split to test the combined model's generalization? Or is this type of ensemble bulletproof enough not to worry about it?

Was it helpful?

Solution

Q1. This can be done either way. You may use only the base model predictions, or those and all the original features, or anywhere in between. Passing along the original features may be known as "feature-weighted stacking," the idea being that the meta-estimator can learn that some of the base models are better on certain subsets of the original data (but AIUI the original approach there is to select/engineer more useful subsetting features, not to simply pass along all the original features; I assume this depends on context).

For example, the parameter passthrough in sklearn's StackingClassifier passes the original dataset along, as does use.feat in mlr's makeStackedLearner.

Q2. Absolutely, making predictions requires each of the base models making their predictions (this can be done in parallel), then passing those off to the meta-estimator.

Q3. Yes, a simple split suffices to prevent the data leak that occurs in more naive stacking, where the meta-estimator is fitted using the "predictions" of the base models on their own train set. And yes, you will need another test set for evaluating the performance of the ensemble. k-fold cross-validation is often recommended for the base models so that more data is available for all the models to train on.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top