statistics or robust statistics for identifying multivariate outliers

https://datascience.stackexchange.com/questions/13018

16-10-2019
|

Question

For the single variate data sets, we can use some straightforward methods, such as box plot or [5%, 95%] quantile to identify outliers. For multivariate data sets, are there any statistics that can be used to identify outliers?

Solution

Multivariate outlier detection can be quite tricky and even 2D data can be difficult to visually decipher at times. You are spot-on in looking for robust statistical treatments analogous to 95% quantiles.

Where as normally distributed data naturally aligns with the chi square distribution, the gold standard for robust statistics in n dimensions would be to use Mahalanobis distances and then eliminate data beyond 95% or 99% quantiles in Mahalanobis space.

Plug and play capabilities are available in scikit-learn and in R.

Here is an excellent theoretical and practical treatment of the methodology:

And here is a big picture viewpoint with some heuristics.

Additionally there is a very sophisticated treatments called PCOUT for outlier detection that instead rely on principal component decomposition. There is a corresponding R package, but the theoretical treatment is behind a paywall:

P. Filzmoser, R. Maronna, M. Werner. Outlier identification in high dimensions, Computational Statistics and Data Analysis, 52, 1694-1711, 2008

Hope this helps!

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange