What you are doing here is nothing more than a fancy range check. If you are not willing to use X
to find outliers in Y
(even though you really should), it would be a lot simpler and better to just check the distribution of Y
to find outliers instead of this improvised SVM solution (for example remove the upper and lower 0.5-percentiles from Y
).
In reality, this is probably not even close to what you really want to do. With this setup you are rejecting Y
values as outliers without considering any context (e.g. X
). Why are you using RBF and how did you come up with that specific value for gamma
? A kernel is total overkill for one-dimensional data.
Secondly, you are training and testing on the same data (Y
). A kitten dies every time this happens. One-class SVM attempts to build a model which recognizes the training data, it should not be used on the same data it was built with. Please, think of the kittens.
Additionally, note that the nu
parameter of one-class SVM controls the amount of outliers the classifier will accept. This is explained in the LIBSVM implementation document (page 4): It is proved that nu
is an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors. In other words: your training options specifically state that up to 1% of the data can be rejected. For one-class SVM, replace can by should.
So when you say that the resulting model does yield performance somewhat in line with what I would expect ... ofcourse it does, by definition. Since you have set nu=0.01
, 1% of the data is rejected by the model and thus flagged as an outlier.