Handling a regularly increasing feature set

https://datascience.stackexchange.com/questions/634

16-10-2019
|

Question

I'm working on a fraud detection system. In this field, new frauds appear regularly, so that new features have to be added to the model on ongoing basis.

I wonder what is the best way to handle it (from the development process perspective)? Just adding a new feature into the feature vector and re-training the classifier seems to be a naive approach, because too much time will be spent for re-learning of the old features.

I'm thinking along the way of training a classifier for each feature (or a couple of related features), and then combining the results of those classifiers with an overall classifier. Are there any drawbacks of this approach? How can I choose an algorithm for the overall classifier?

Solution

In an ideal world, you retain all of your historical data, and do indeed run a new model with the new feature extracted retroactively from historical data. I'd argue that the computing resource spent on this is quite useful actually. Is it really a problem?

Yes, it's a widely accepted technique to build an ensemble of classifiers and combine their results. You can build a new model in parallel just on new features and average in its prediction. This should add value, but, you will never capture interaction between the new and old features this way, since they will never appear together in a classifier.

OTHER TIPS

Here's an idea that just popped out of the blue – what if you make use of Random Subspace Sampling (as in fact Sean Owen already suggested) to train a bunch of new classifiers every time a new feature appears (using a random feature subset, including the new set of features). You could train those models on a subset of samples as well to save some training time.

This way you can have new classifiers possibly taking on both new and old features, and at the same time keeping your old classifiers. You might even, perhaps using a cross validation technique to measure each classifier's performance, be able to kill-off the worst performing ones after a while, to avoid a bloated model.

What you describe falls in the category of concept drift in machine learning. You might find interesting and actionable ideas in this summary paper and you'll find a taxonomy of the possible approaches in these slides.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange