Question

I have a dataset, which contains the processing conditions (i.e., 42 features) and the property (i.e., 1 target) of a class of material. To know the performance of different machine learning models, I tested five different machine learning models by considering different numbers of features in training. These models are linear regression (LR), Bayesian ridge (BR) , k-nearest neighbor (NN), random forest (RF), and support vector machines (SVM) regression. The coefficient of determination (R2) for the test dataset is used to represent the performance of trained machine learning models. enter image description here

As we can see that the max. accuracy of these models are in order: RF>BR~LR~SVM>NN. The top 8 features are required to obtain good accuracy for RF and afterward the accuracy is almost independent of the number of top-ranking features. The performance of BR, LR and SVM continuously improve with an increasing number of top-ranking features until reaching the maximum with top 26 features. NN exhibits different trends from others. It already reaches the best performance with top 8-12 features but gets worse and worse with more features.

I am wondering what are the possible reasons for this results? Some directions or some hints? for example:

  1. why the accuracy is in the order of RF>BR~LR~SVM>NN.
  2. why RF model reaches a good accuracy with a few features then keeps almost constant with more features?
  3. why linear-based models BR and LR have very similar performance as the SVM models.
  4. why NN model reaches a good accuracy with a few features then the accuracy reduces with increasing numbers of features?

I understand the reason is from case to case, but what is the general explanation or directions for finding the answers?

Was it helpful?

Solution

First, it would be beneficial if you could mention whether that $R2$ presented in the figure is for the test or training dataset.

Let's assume that it is for the test dataset.

The answer to all of your questions is subjective and depends on the type of data you used for this purpose. But let me briefly answer them.

Because the Random Forest model (RF) is an ensembling method (a combination of weak learners), it is expected to have a very good performance. This method is robust to overfitting. We can see that when the number of features increases, there is no drop in the performance of the RF model; while using a large number of features could lead to a drop in the performance of the model on the test dataset. RF model is not prone to overfitting because of using many weak learners (decision trees here).

Linear Regression and SVM probably perform the same because you might use a linear kernel for SVM. If that is not the case, it means that transforming the data to a new space (SVM transforms the data into new space to easily separate the classes) is not useful in your dataset because of the nature of your data (it means that your data might be linearly separable).

It seems that you did not use regularization for linear regression. If that is true, your model is overfitted especially when you have a large number of features. If you add regularization to your model, the performance of the linear regression model does not decrease as the number of features increases.

Naive Bayes is good when features are independent and when dependencies of features from each other are similar between features. So, if they are not true for your dataset that might be the reason that Naive Bayes does not work well. Also, Naive Bayes classifiers can quickly learn to use high dimensional features with limited training data because of the independence assumption. So, it is expected that Naive Bayes works better when the number of features >> sample size compared to more sophisticated ML algorithms. But we can see that is not the case for your study. Probably, it is because of the violation of the feature's independence.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top