Question

I have below person and his skill data already indexed in solr

Candidate, Skills
-----------------
1, Java, JSP, Servlet, Spring, Hibernate 
2, Java JSP, JDBC
3, Java, JDBC, RMI
4, JDBC, SQL
5, .Net, C#

from above I would like build terms relationship data with each skills and how much they are related, so later this information can be used for better candidate search for any requirement & new skills can be properly associated with existing skills.

based on research what i found that i need to cluster my vector terms may be mahout or carrot2, but i am not sure how this can how this can be performed.

i believe carrot2 does in memory clustering so scaling can be issue so preferred option i am looking at mahout.

Was it helpful?

Solution

Mahout is the library for distributed and scalable machine learning algorithms. So if your data size is less then 500Gb and you do not expect use more then 1 machine - Carrot2 or Weka or python scikit + nltk is a right choice. Otherwise Mahout. Second point is that Mahout can work with Solr vectors "out of the box".

OTHER TIPS

Carrot2 is suitable for clustering of natural text (such as web pages, news articles), while your data is really a set of symbols. Therefore, Carrot2 will not help you much in this task. Mahout does have a number of clustering algorithms suitable for your data, you can also try Weka which comes with a comprehensive set of machine learning tools and a UI.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top