Pergunta

I have number of smaller data sets, containing 10 XY coordinates each. I am using Matlab (R2012a)and k-means to obtain a centroid. In some of the clusters (see figure below) I can see some extreme points, beacuse my dataset are as small as they are, one outliner destroys the value of my centroid. Is there a easy way to exlude these points? Supposingly Matlab has a 'exclude outliers' function but I can't see it anywhere in the tool menu.. Thank you for your help! (and yes I am new to this:-)

enter image description here

Foi útil?

Solução

k-means can be quite sensitive to outliers in your data set. The reason is simply that k-means tries to optimize the sum of squares. And thus a large deviation (such as of an outlier) gets a lot of weight.

If you have a noisy data set with outliers, you might be better off using an algorithm that has specialized noise handling such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise). Note the "N" in the acronym: Noise. In contrast to e.g. k-means, but also many other clustering algorithms, DBSCAN can decide to not cluster objects that are in regions of low density.

Outras dicas

You're looking for something like "Outlier removal" and as others have linked to above, "there is no rigorous mathematical definition of what constitutes an outlier" - http://en.wikipedia.org/wiki/Outlier#Identifying_outliers.

Outlier detection is even more difficult when you're doing unsupervised clustering since you're both trying to learn what the clusters are, and what data points correspond to "no" clusters.

One simple definition is to consider all data points that are "far" from every other data point as an outlier. E.g., you might consider removing the point with the maximum smallest distance to any other point:

x = randn(100,2); 
x(101,:) = [10 10];  %a clear outlier
nSamples = size(x,1);

pointToPointDistVec = pdist(x);
pointToPointDist = squareform(pointToPointDistVec);
pointToPointDist = pointToPointDist + diag(inf(nSamples,1)); %remove self-distances; set to inf

smallestDist = min(pointToPointDist,[],2);
[maxSmallestDist,outlierInd] = max(smallestDist);

You can iterate the above a few times to iteratively remove points. Note that this will not remove outliers that happen to have at least one nearby neighbor. If you read the WIKI page, and see an algorithm that might be more helpful, try and implement it and ask about that specific approach.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top