문제

I am performing phishing URL classification, and I am comparing several ML classifiers on a balanced 2-class data-set (legitimate URL, phishy URL).

The ensemble and boosting classifiers such as Random Forest, Ada Boost, Extra Trees etc AND K-NN achieve an accuracy about 90%, and a False Positive Rate about 11-12%. (fig.)

On the other hand, classifiers such as SVM, Logistic Regression, Multinomial NB and Bernoulli NB seem to perform poorly with accuracies fluctuating between 70% - 80% and higher false positives.

Here is the thing. I also tried Gaussian NB and although it yields by far the worst accuracy 58.84% it has an incredibly low False Positive Rate 2.14% (and thus a high FNR)

  1. I have no idea why this is happening. Any ideas?
  2. Why some classifiers perform so poorly and others not?

I parametrized them all with Grid Search, they are used on the same dataset (about 30k records of each class) and I perform a 3-fold cross validation. It doesn't make any sense to me, especially for SVM. At last I use about 20 features.

P.S: I use python's sk-learn library


MEtrics

도움이 되었습니까?

해결책

  1. I find the easiest way for people to understand this is to think of the confusion matrix. Accuracy score is just one measure of a confusion matrix, namely all the correct classifications over all the prediction data at large: $$\frac{True Positives + True Negatives}{True Positives + True Negatives + False Positives + False Negatives}$$ Your False Negative Rate is calculated by: $$\frac{False Negatives}{False Negatives + True Positives}$$ One model may turn out to have a worse accuracy, but a better False Negative Rate. For example, your model with worse accuracy may in fact have many False Positives but few False Negatives, leading to a lower False Negative Rate. You need to choose the model which produces the most value for your specific use case.

  2. Why do some classifier perform poorly? While an experience practitioner might surmise what could be a good modeling approach for a dataset, the truth is that for all datasets, there is no free lunch... also known as "The Lack of A Priori Distinctions Between Learning Algorithms" You don't know ahead of time if the best approach will be deep learning, gradient boosting, linear approaches, or any other number of models you could build.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top