What is the optimum way to select the most dissimilar individuals from a population?

Question 1

If the kmeans is the most consuming part, you can apply the k-means algorithm to a random subset of your population. If the size of the random subset is still large compared with the number of centroids you choose, you will get mostly the same results. Alternatively, you can run several kmeans on several subsets and merge the results.

Another option is to try the k-medoid algorithm, which will give centroids which are part of the population so the second part of finding the member of each cluster closest to its centroid will not be needed. It might be slower than the k-means though.

Question 2

You can try something like below, although I think that the slowest part of your code is actually kmeans. For large datasets you may consider, depending on shape of the data, reducing nstart parameter or subsetting.

library(plyr)

markers <- data.frame(x=rnorm(1e6), y=rnorm(1e6), z=rnorm(1e6))

mostdiff <- function(markers, iter.max=1e5) {
    ncols <- ncol(markers)

    km <- kmeans(markers, 100, iter.max=iter.max)

    markers$cluster <- km$cluster
    markers$d <- rowSums(apply(
        markers[,1:ncols] - km$centers[markers$cluster], 2, function(x) x * x
    ))

    result <- subset(
        merge(
            ddply(markers, ~cluster, summarise, d=min(d)),
            markers,
            x.all=T, y.all=F
        ),
        select=-c(d, cluster)
    )

    return(result)
}

mostdiff(markers, 100)

Question 3

If you're looking for outliers in your population and not necessarily "markers" with which to identify them, I'd suggest using mahalanobis distance. It is usually the go-to first line tool for outlier identification.

k <- 1000 # Number of outliers from the population we want
n <- length(x)
ma.dist <- mahalanobis(x, colMeans(x), cov(x))
ix <- order(ma.dist)
mdf <- x[ix >= n - k]

Question 4

In case any body else is trying to do the same thing. here is the answer based on damienfrancois recommendation: Beside using the raw data, pam k-medriod allow us to use our own distance matrix , which is very important in some cases where we have so many missing values in the marker data.

library(BLR)

data(wheat)

library(cluster)

pam_out<-pam(t(X),100)

selec.markers<-as.data.frame(colnames(X)[pam_out$id.med])