سؤال

I'm looking for a Java Matrix library to perform data analysis and implement clustering algorithms ( Like K-means or DBSCAN )

I found Colt and Parallel Colt(best performing with large and small data sets) but apparently they do not support String Matrices . Data sets entries are supposed to be only Double matrices .

Are there any suggestions ?

Thank you for your help in advance .

هل كانت مفيدة؟

المحلول

Have a look at ELKI. It supports arbitrary distance functions, and already has cosine distance. So it apparently can run these algorithms on text data.

Note that for most applications, you will want to convert your string data to TF-IDF vectors, as cosine distance is also defined on numerical vectors. However, these vectors are usually sparse, so an optimized handling for sparse vectors pays off.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top