Question

I have 18997 docs (docs can be >100K) to cluster and make it into chunks (each 5000 docs). Then, I partail_fitted each chunk with MBKmeans.I select docs by cluster like each_chunk[labels == e]. No docs selected. How can I select docs by cluster?

[update #1] How should I configure MBKmeans according to each batch? (total docs=100K, each batch=5000 docs)

[update #2] Is there an example for selecting cluster member using MBKmeans parital_fit?

Thanks in advance.

Here is my code:

    def selectDocsByCluster(self, chunks):
        centroids = self.kmeans.cluster_centers_
        labels = self.kmeans.labels_

        if self.verbose:
            print 'labels:', len(labels.tolist())

        checker_docs_count = 0

        collected_data = dict()

        for e, centroid in enumerate(centroids):
            members = labels == e

            for each_chunk in chunks:
                docs_by_cluster = each_chunk[members]

                if e in collected_data.keys():
                    collected_data[e].extend(docs_by_cluster.tolist())
                else:
                    collected_data.update({e:docs_by_cluster.tolist()})

            if self.verbose:
                print 'Members:', len(members.tolist()), type(members)

            total_selected = len(collected_data.get(e) or [])
            print 'clusterID:', e, "Total Docs:", total_selected

            checker_docs_count += total_selected

        print 'check docs count:', checker_docs_count

Detail code is here.

Was it helpful?

Solution

Finally, I got the answer by reading sci-kit.learn API and by doing more researches. I wrote the code and noticed the answer.

    def partialFitChunks(self, chunks):

        """ MBKmean partial fit vectorized chunks."""

        for e, each_chunk in enumerate(chunks):
            if self.verbose:
                print 'current chunkID:', e
                print each_chunk
                print each_chunk.tolist()[:10]

            self.kmeans.partial_fit(each_chunk)
            if self.verbose:
                print 'no. of label:', len(self.kmeans.labels_.tolist() or [])
                print 'clustered docs:', self.kmeans.counts_
                print 'total docs processed:', sum(self.kmeans.counts_.tolist())

            predicted = self.kmeans.predict(each_chunk)
            if self.verbose:
                predicted_tolist = predicted.tolist()
                print 'Total predicted docs:', len(predicted_tolist)
                counter = Counter(predicted_tolist)
                print 'By Cluster:',sortByIndex(counter.items(), 0, True)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top