Clustering huge data matrix in python?

https://stackoverflow.com/questions/21136993

28-09-2022
|

Вопрос

I want to cluster 1,5 million of chemical compounds. This means having 1.5 x 1.5 Million distance matrix...

I think I can generate such a big table using pyTables but now - having such a table how will I cluster it?

I guess I can't just pass pyTables object to one of scikit learn clustering methods...

Are there any python based frameworks that would take my huge table and do something useful (lie clustering) with it? Perhaps in distributed manner?

Решение

Maybe you should look at algorithms that don't need a full distance matrix.

I know that it is popular to formulate algorithms as matrix operations, because tools such as R are rather fast at matrix operation (and slow on other things). But there is a whole ton of methods that don't require O(n^2) memory...

Другие советы

I think the main problem is memory. 1,5 x 1,5 million x 10B (1 element size) > 20TB You can use bigdata database like pyTables, Hadoop http://en.wikipedia.org/wiki/Apache_Hadoop and MapReduce algorithm.

Here some guides: http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html

Or use Google App Engine Datastore with MapReduce https://developers.google.com/appengine/docs/python/dataprocessing/ - but now it isn't production version

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow