Question

I understand that using stacking and ensembling has become popular, and these methods can give better results than using a single algorithm.

My question is: What are the reasons, statistical or otherwise, behind the improvement in results?

I also understand that at a high level that combining these methods will combine the predictions from different algorithms. Each algorithm has its own strengths and weaknesses, but not sure how combining them will actually improve the results.

Here is a simple balanced example (50% of the labels are actually T and 50% are actually F), where I think stacking or ensembling algorithms will still give the same result as the original model. In this case, for the ties (1 F and 1 T), I decided to select T. The same problem would occur if I selected F instead. This is a little more complicated if I used the predict_proba, but think the same problem would occur.

enter image description here

Was it helpful?

Solution

There are many ways in which Ensembling can be done and each one has a different foundation logic to gain improvement.
Key variations can be -
1. Nature(High Bias/High Variance) of models in the ensemble
2. How we put models into work i.e. same model type, different model type, parallel, sequential, sample data, full data etc.
3. How we combine individual prediction

Let's see a few of the key approaches -

1. Simple Voting based ensembling

Dataset doesn't have the same pattern across the feature space. Its pattern will support one type of model in most of the part but a different type of model in some of the part.

Observation on an experiment for multiple models.
Despite their overall scores being identical, the two best models – neural network and nearest neighbour – disagreed a third of the time; that is, they made errors on very different regions of the data. We observed that the more confident of the two methods was right more often than not.

Ref - Ensemble Methods in Data Mining:Improving Accuracy Through Combining Predictions

What it meant, if two models have 70% accuracy each and both differ on 10% of the data.
There is a good chance that the more confident one is true on 0-10% of the time and that will be the gain on combining both of them using a Soft voting strategy.

Intuition
- If we use a KNN and a Linear Regression. Definitely, KNN will be better in most of the space(i.e. away from the Regression plane) but for data points which are near to the plane, Regression will be more confident.

$\hspace{4cm}$
$\hspace{4cm}$Ref - Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

2. Bagging based Ensembling

A model with very high variance is prone to overfit. We can convert this challenge to our advantage if we figure out a way to average out the variance. This is the logic behind bagging based model.

Intuition - At a very high level, the high variance model when built on a different random sample will create decision boundaries which when averaged will smoothen the prediction and variance will be reduced.
An intuitive example is Here
Why not High Bias models - A high bias model (e.g. A Regression Line) will not change much with every sample as the sample will have the roughly same distribution and the slight difference doesn't impact these models. So, it will end up almost the same models for every sample.
As shown in this example for 3 different models.

enter image description here
$\hspace{4cm}$Ref - Hands-On Machine Learning with R, Bradley Boehmke & Brandon Greenwell

3. Boosting based Ensembling

The main idea of boosting is to add new models to the ensemble sequentially. In essence, boosting attacks the bias-variance-tradeoff by starting with a weak model (e.g., a decision tree with only a few splits) and sequentially boosts its performance by continuing to build new trees, where each new tree in the sequence tries to fix up where the previous one made the biggest mistakes (i.e., each new tree in the sequence will focus on the training rows where the previous tree had the largest prediction errors)

Ref - Hands-On Machine Learning with R, Bradley Boehmke & Brandon Greenwell

Intuition - We start with a weak model(e.g. a DT stump), we may think it as a simple line(Hyper-plane) across the dataset space, splitting it into two parts. We repeat this step but with additional info i.e. adding weight to miss-classified records. In the end, we do a weightage voting e.g. more weight to better Model.

Let's say the first model predicted 57 correct out of 100. Now the 2nd model will have additional weight for the 43 records. Let's say it end up 55 correct. So, the first model will have higher weights. It means you have sure-shot 57 correct + there is a good chance that because of the added weight on 43 records, some will be predicted correctly with very high confidence and that will be the addition for the ensemble.

4. Meta-learner/Generalized Stacking

In this approach, the prediction of multiple models is used as an input to a meta-learner to decide the final prediction using an additional set of data.
So, here we are not using any ready-made function for voting e.g. soft/hard voting but allowing another model to learn the bias pattern of initial model's prediction and learn the adjustment if any.
$\hspace{8cm}$Ref - developer.ibm.com
This was a very simple explanation of generalized stacking approach but stacking has been extensively used in competitions. To an unimaginative level which is almost impossible to comprehend and explain.

As done in below mentioned approach Ref

$\hspace{2cm}$enter image description here

Your sample data

We have to attack the model Bias/Variance pattern, Confidence in prediction probability etc. to gain an advantage. We can't get an improvement on any dataset/model combo by just doing hard voting.
Maybe you can investigate this example


dataset = sklearn.datasets.load_breast_cancer(return_X_y=False)
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = dataset.target 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.20,random_state=201)

from sklearn.neighbors import  KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

knn_clf = KNeighborsClassifier(n_neighbors=2)
svm_clf = SVC(probability=True)

voting_clf = VotingClassifier(
    estimators=[('knn', knn_clf), ('svc', svm_clf)], voting='soft')

voting_clf.fit(x_train, y_train)

from sklearn.metrics import accuracy_score
for clf in (knn_clf, svm_clf, voting_clf):
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

KNeighborsClassifier 0.9298245614035088
SVC 0.9122807017543859
VotingClassifier 0.956140350877193

OTHER TIPS

It sometimes is called "ensemble learning" where several "weak learners" make a prediction. These predictions are "combined" by some meta-model. A simplistic approach would be that you just use the majority vote. You can also use logistic regression. You can (and should!) of course check the performance of the stacked model(s) by predicting on test data. Related ensemble techniques are boosting and bagging.

Simplified Example: Assume you want to predict a binary outcome. Say you have two different models, which do not perform extremely well, but perform better than random guessing. Also, assume the models are independent. Suppose each of your models makes a correct prediction with $0.8\%$ probability. Now when you have two models, the possible outcomes are:

Correct, Correct: $0.8*0.8=0.64$

Wrong, Wrong: $0.2*0.2=0.04$

Correct, Wrong: $0.8*0.2=0.16$

Wrong, Correct: $0.2*0.8=0.16$

You see that the probability that both models are wrong is "only" $0.04$. So if you manage to identify the remaining cases as "correct predictions" by some intermediate/meta (stacking) model, you end up with quite a good result. Essentially you would "boost" the expected accuracy of the model(s) from $0.8$ (single model) to $0.64+0.16+0.16=0.96$ (ensemble) here.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top