Median absolute deviation for time series outlier detection in Amazon Redshift

https://dba.stackexchange.com/questions/142661

02-10-2020
|

문제

Context

I am tasked with trying to detect outliers in time series data in an Amazon Redshift (PostgreSQL) system. Also known as a public holiday detector around the office. The method I have been using takes a windowed average of the previous N data points, I have also a windowed Standard Deviation and then take those stats about the N prior points and apply the following to the current datapoint:

(x0 - avg(x1:xN)) / stddev(x1:xN) > threshold

So window length and threshold have been sufficient to play with but this method is not a robust method since we have had significant growth after an advertising campaign then the series is straying hugely away from the running average and standard deviation threshold and everything is an outlier.

Reducing the window length allows it to adapt to these changes faster but it doesn't have as good of a long term model.

Increasing the threshold for this sort of growth would mean previous outliers we detected will no longer get detected.

These related questions provide suggestions in R but the answer quite often mention Median Absolute Deviation as a robust method:

https://en.wikipedia.org/wiki/Median_absolute_deviation

tl;dr

How do I implement Median Absolute Deviation on a times series data set in Amazon Redshift?

I'm not sure if I'm missing something fundamental about the method but I'd like it to only work on a window and not have it work on the entire data set. Although the Median Window functions do not allow a frame clause.

If not this method, then point me in the right direction of more sophisticated outlier detection queries in Amazon Redshift would be appreciated.

해결책

I believe this is very doable with a CTE or subquery!

MAD is can be calculated by composing several Redshift functions:

the median of a list of values
absolute value of the difference between each value and that median
the median of those values

I wrote this in the form of:

WITH
medians AS (
  SELECT
    t.values,
    MEDIAN(t.values) OVER () as median_value,
    ABS(t.values-MEDIAN(t.values) OVER ()) AS absolute_deviation
  FROM table AS t
  GROUP BY t.values
)
SELECT
  MEDIAN(absolute_deviation) OVER () AS median_absolute_deviation
FROM medians

Hope this helps

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 dba.stackexchange