After fitting a data to Gaussian model, how to check if a new data fall within a variance distance of the model

StackOverflow https://stackoverflow.com/questions/20274787

  •  06-08-2022
  •  | 
  •  

Question

If we have a data set that we already fit in a 3D Gaussian model. We have now the mean, the covariance matrix and the pdf. If I have a new data point that I wanted to check if it is within the covariance (inside the model). For some reasons, I need the answer to be related to the variance. How can I do that in matlab or even logically?

Was it helpful?

Solution

Similar to what roybatty wrote, but more concretely: If you want to check the distance of a data point to the distribution in terms of standard deviations, you have the problem that the standard deviation is different in different directions. The standard way to take care of this is to compute the Mahalanobis distance between the distribution mean and the data point:

Mahalanobis distance

If you estimate the distribution parameters from a set of data points x like this

m = mean(x);
S = cov(x);

then for a new data point xn you obtain the Mahalanobis distance like this:

DM = sqrt((xn - m)' * inv(S) * (xn - m));

DM is the distance of xn from the center of the distribution m in units of standard deviations, and you can apply the usual outlier criteria, e.g. DM > 3.

OTHER TIPS

Matlab has a Covariance function: cov(x) returns the variance of x where x is a vector. The variable x can also be a matrix so in that case cov(x) returns the variance of each column. Check the help files to see if this is what you want.

Generally speaking, when you talk about comparing the value of a data point to a statistical model, this usually means checking if the data value is less than some multiple of the "standard deviation." For brevity in the discussion, this is denoted by the Greek letter sigma:

sigma = sqrt(variance);

The "outlier" data points are usually defined as data values > n*sigma, where n is chosen based on the application, e.g., if we say we want all data values within a 2*sigma distribution, we would take all data values < 2*sigma.

In your case it sounds like you interested in data points with 1-sigma? Either way, since you already have the variance, calculate sigma using the above relation and then apply a check based on this value.

Again, I'm not exactly sure this is what you want but the way you described your problem makes me think this is what you are after. Google around a bit on standard deviation, variance, and outliers, to see if this explanation fits your application.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top