Question

I have tried to use k-means clustering to select the most diverse markers in my population, for example, if we want to select 100 lines I cluster the whole population to 100 clusters then select the closest marker to the centroid from each cluster.

The problem with my solution is it takes too much time (probably my function needs optimization), especially when the number of markers exceeds 100000.

So, I will appreciate it so much if anyone can show me a new way to select markers that maximize diversity in my population and/or help me optimize my function to make it work faster.

Thank you

# example:

library(BLR)
data(wheat)
dim(X)
mdf<-mostdiff(t(X), 100,1,nstart=1000)

Here is the mostdiff function that i used:

mostdiff <- function(markers, nClust, nMrkPerClust, nstart=1000) {
    transposedMarkers <- as.array(markers)
    mrkClust <- kmeans(transposedMarkers, nClust, nstart=nstart)
    save(mrkClust, file="markerCluster.Rdata")

    # within clusters, pick the markers that are closest to the cluster centroid
    # turn the vector of which markers belong to which clusters into a list nClust long
    # each element of the list is a vector of the markers in that cluster

    clustersToList <- function(nClust, clusters) {
        vecOfCluster <- function(whichClust, clusters) {
            return(which(whichClust == clusters))
        }
        return(apply(as.array(1:nClust), 1, vecOfCluster, clusters))
    }

    pickCloseToCenter <- function(vecOfCluster, whichClust, transposedMarkers, centers, pickHowMany) {
        clustSize <- length(vecOfCluster)
        # if there are fewer than three markers, the center is equally distant from all so don't bother
        if (clustSize < 3) return(vecOfCluster[1:min(pickHowMany, clustSize)])

        # figure out the distance (squared) between each marker in the cluster and the cluster center
        distToCenter <- function(marker, center){
            diff <- center - marker    
            return(sum(diff*diff))
        }

        dists <- apply(transposedMarkers[vecOfCluster,], 1, distToCenter, center=centers[whichClust,])
        return(vecOfCluster[order(dists)[1:min(pickHowMany, clustSize)]]) 
    }
}
Was it helpful?

Solution

If the kmeans is the most consuming part, you can apply the k-means algorithm to a random subset of your population. If the size of the random subset is still large compared with the number of centroids you choose, you will get mostly the same results. Alternatively, you can run several kmeans on several subsets and merge the results.

Another option is to try the k-medoid algorithm, which will give centroids which are part of the population so the second part of finding the member of each cluster closest to its centroid will not be needed. It might be slower than the k-means though.

OTHER TIPS

You can try something like below, although I think that the slowest part of your code is actually kmeans. For large datasets you may consider, depending on shape of the data, reducing nstart parameter or subsetting.

library(plyr)

markers <- data.frame(x=rnorm(1e6), y=rnorm(1e6), z=rnorm(1e6))

mostdiff <- function(markers, iter.max=1e5) {
    ncols <- ncol(markers)

    km <- kmeans(markers, 100, iter.max=iter.max)

    markers$cluster <- km$cluster
    markers$d <- rowSums(apply(
        markers[,1:ncols] - km$centers[markers$cluster], 2, function(x) x * x
    ))

    result <- subset(
        merge(
            ddply(markers, ~cluster, summarise, d=min(d)),
            markers,
            x.all=T, y.all=F
        ),
        select=-c(d, cluster)
    )

    return(result)
}

mostdiff(markers, 100)

If you're looking for outliers in your population and not necessarily "markers" with which to identify them, I'd suggest using mahalanobis distance. It is usually the go-to first line tool for outlier identification.

k <- 1000 # Number of outliers from the population we want
n <- length(x)
ma.dist <- mahalanobis(x, colMeans(x), cov(x))
ix <- order(ma.dist)
mdf <- x[ix >= n - k]

In case any body else is trying to do the same thing. here is the answer based on damienfrancois recommendation: Beside using the raw data, pam k-medriod allow us to use our own distance matrix , which is very important in some cases where we have so many missing values in the marker data.

library(BLR)

data(wheat)

library(cluster)

pam_out<-pam(t(X),100)

selec.markers<-as.data.frame(colnames(X)[pam_out$id.med])
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top