100% accuracy on both train and test after feature engineering

https://datascience.stackexchange.com/questions/81332

13-12-2020
|

Question

The original dataset is of ~17K compound structures almost equally divided with labels indicating yes or no, after heavy use of mol2vec and rdkit I have created ~300 datapoints

Using the boosted trees method on the same shuffled train and test dataset gives 98% train accuracy and 89% test accuracy, but a simple neural network gives 100% train and test accuracy

I have checked the code again to ensure target leakage is not occurring, I have also coded it from the scratch twice to ensure I haven't made any mistake, yet I do not believe I should be getting 100% accuracy on both train and test

Does this mean that the model is actually accurate due to so many data points?

Solution

Yes - Getting 100% accuracy is possible for neural networks compared to tree-based models. Neural networks can learn non-linear relationships through the activation function. Tree-based models are restricted to piece-wise linear relationships.

OTHER TIPS

Doing feature selection on the full dataset can lead to just this scenario, especially when so many features are available. This tread from the Stats stack exchange has some more information on how this happens, but the upshot is to make sure to do feature selection on some subset ("training set").

https://stats.stackexchange.com/questions/27750/feature-selection-and-cross-validation/

Please give this a try and speak up if you're still running into issues--just update the question with the relevant additional detail. It's certainly possible to hit 100%, but you would want to make sure there was no information leakage.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange