Scalable Outlier/Anomaly Detection
-
16-10-2019 - |
Question
I am trying to setup a big data infrastructure using Hadoop, Hive, Elastic Search (amongst others), and I would like to run some algorithms over certain datasets. I would like the algorithms themselves to be scalable, so this excludes using tools such as Weka, R, or even RHadoop. The Apache Mahout Library seems to be a good option, and it features algorithms for regression and clustering tasks.
What I am struggling to find is a solution for anomaly or outlier detection.
Since Mahout features Hidden Markov Models and a variety of clustering techniques (including K-Means) I was wondering if it would be possible to build a model to detect outliers in time-series, using any of this. I would be grateful if somebody experienced on this could advice me
- if it is possible, and in case it is
- how-to do it, plus
- an estimation of the effort involved and
- accuracy/problems of this approach.
Solution
I would take a look at t-digest algorithm. It's been merged into mahout and also a part of some other libraries for big data streaming. You can get more about this algorithm particularly and big data anomaly detection in general in next resources:
- Practical machine learning anomaly detection book.
- Webinar: Anomaly Detection When You Don't Know What You Need to Find
- Anomaly Detection in Elasticsearch.
- Beating Billion Dollar Fraud Using Anomaly Detection: A Signal Processing Approach using Argyle Data on the Hortonworks Data Platform with Accumulo
OTHER TIPS
You can refer to my response related to h2o R or Python anomaly detection method in stackexchange,since that is scalable too.