Outlier detection in probability/ frequency distribution

https://stackoverflow.com/questions/20413188

29-08-2022
|

Pergunta

I have following two dimensional dataset. Both (X and Y) are continuous random variables.

Z = (X, y) = {(1, 7), (2, 15), (3, 24), (4, 25), (5, 29), (6, 32), (7, 34), (8, 35), (9, 27), (10, 39)}

I want to detect outliers with respect to the y variable's values. The normal range for y variable is 10-35. Thus 1st and last pairs, in above dataset, are outliers and others are normal paris. I want to transform variable z = (x, y) into probability/ frequency distribution that outlier values (first and last pair) lies outside standard deviation 1. Can any one help me out to solve this problem.

PS: I have tried different distances such as eucledian and mahalanobis distances but they didn't worked.

Solução

I'm not exactly sure what your end goal is, but I'm going to assume you format your x,y variables in a nx2 matrix, so z = [x,y] where x:= nx1 and y:= nx1 vectors.

So what you are asking is for a way to separate out data points where y is outside of 10-35 range? For that you can use a conditional statement to find indexes where that occurs:

index = z(:,2) <= 35 & z(:,2) >= 10;  %This gives vector of 0's & 1's length nx1
z_inliers = z(index,:);      %This has a [x,y] matrix of only inlier data points
z_outliers = z(~index,:);    %This has a [x,y] matrix of outlier data points

If you want to do this according to standard deviation then instead of 10 and 35 do:

low_range = mean(z(:,2)) - std(z(:,2));
high_range = mean(z(:,2)) + std(z(:,2));
index = y <= high_range & y >= low_range;

Then you can plot your pdf's or whatever with those points.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow