Question

For the single variate data sets, we can use some straightforward methods, such as box plot or [5%, 95%] quantile to identify outliers. For multivariate data sets, are there any statistics that can be used to identify outliers?

Was it helpful?

Solution

Multivariate outlier detection can be quite tricky and even 2D data can be difficult to visually decipher at times. You are spot-on in looking for robust statistical treatments analogous to 95% quantiles.

Where as normally distributed data naturally aligns with the chi square distribution, the gold standard for robust statistics in n dimensions would be to use Mahalanobis distances and then eliminate data beyond 95% or 99% quantiles in Mahalanobis space.

Plug and play capabilities are available in scikit-learn and in R.

Here is an excellent theoretical and practical treatment of the methodology:

enter image description here

And here is a big picture viewpoint with some heuristics.

Additionally there is a very sophisticated treatments called PCOUT for outlier detection that instead rely on principal component decomposition. There is a corresponding R package, but the theoretical treatment is behind a paywall:

P. Filzmoser, R. Maronna, M. Werner. Outlier identification in high dimensions, Computational Statistics and Data Analysis, 52, 1694-1711, 2008

Hope this helps!

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top