Mahout is the library for distributed and scalable machine learning algorithms. So if your data size is less then 500Gb and you do not expect use more then 1 machine - Carrot2 or Weka or python scikit + nltk is a right choice. Otherwise Mahout. Second point is that Mahout can work with Solr vectors "out of the box".
terms relation & score from solr
-
18-06-2023 - |
Question
I have below person and his skill data already indexed in solr
Candidate, Skills
-----------------
1, Java, JSP, Servlet, Spring, Hibernate
2, Java JSP, JDBC
3, Java, JDBC, RMI
4, JDBC, SQL
5, .Net, C#
from above I would like build terms relationship data with each skills and how much they are related, so later this information can be used for better candidate search for any requirement & new skills can be properly associated with existing skills.
based on research what i found that i need to cluster my vector terms may be mahout or carrot2, but i am not sure how this can how this can be performed.
i believe carrot2 does in memory clustering so scaling can be issue so preferred option i am looking at mahout.
Solution
OTHER TIPS
Carrot2 is suitable for clustering of natural text (such as web pages, news articles), while your data is really a set of symbols. Therefore, Carrot2 will not help you much in this task. Mahout does have a number of clustering algorithms suitable for your data, you can also try Weka which comes with a comprehensive set of machine learning tools and a UI.