Removing outliers from a k-mean cluster

Question 1

k-means can be quite sensitive to outliers in your data set. The reason is simply that k-means tries to optimize the sum of squares. And thus a large deviation (such as of an outlier) gets a lot of weight.

If you have a noisy data set with outliers, you might be better off using an algorithm that has specialized noise handling such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise). Note the "N" in the acronym: Noise. In contrast to e.g. k-means, but also many other clustering algorithms, DBSCAN can decide to not cluster objects that are in regions of low density.

Question 2

You're looking for something like "Outlier removal" and as others have linked to above, "there is no rigorous mathematical definition of what constitutes an outlier" - http://en.wikipedia.org/wiki/Outlier#Identifying_outliers.

Outlier detection is even more difficult when you're doing unsupervised clustering since you're both trying to learn what the clusters are, and what data points correspond to "no" clusters.

One simple definition is to consider all data points that are "far" from every other data point as an outlier. E.g., you might consider removing the point with the maximum smallest distance to any other point:

x = randn(100,2); 
x(101,:) = [10 10];  %a clear outlier
nSamples = size(x,1);

pointToPointDistVec = pdist(x);
pointToPointDist = squareform(pointToPointDistVec);
pointToPointDist = pointToPointDist + diag(inf(nSamples,1)); %remove self-distances; set to inf

smallestDist = min(pointToPointDist,[],2);
[maxSmallestDist,outlierInd] = max(smallestDist);

You can iterate the above a few times to iteratively remove points. Note that this will not remove outliers that happen to have at least one nearby neighbor. If you read the WIKI page, and see an algorithm that might be more helpful, try and implement it and ask about that specific approach.