Algorithms for text clustering

https://datascience.stackexchange.com/questions/979

16-10-2019
|

Question

I have a problem of clustering huge amount of sentences into groups by their meanings. This is similar to a problem when you have lots of sentences and want to group them by their meanings.

What algorithms are suggested to do this? I don't know number of clusters in advance (and as more data is coming clusters can change as well), what features are normally used to represent each sentence?

I'm trying now the simplest features with just list of words and distance between sentences defined as:

(A and B are corresponding sets of words in sentence A and B)

Does it make sense at all?

I'm trying to apply Mean-Shift algorithm from scikit library to this distance, as it does not require number of clusters in advance.

If anyone will advise better methods/approaches for the problem - it will be very much appreciated as I'm still new to the topic.

Solution

Check the Stanford NLP Group's open source software (http://www-nlp.stanford.edu/software), in particular, Stanford Classifier (http://www-nlp.stanford.edu/software/classifier.shtml). The software is written in Java, which will likely delight you, but also has bindings for some other languages. Note, the licensing - if you plan to use their code in commercial products, you have to acquire commercial license.

Another interesting set of open source libraries, IMHO suitable for this task and much more, is parallel framework for machine learning GraphLab (http://select.cs.cmu.edu/code/graphlab), which includes clustering library, implementing various clustering algorithms (http://select.cs.cmu.edu/code/graphlab/clustering.html). It is especially suitable for very large volume of data (like you have), as it implements MapReduce model and, thus, supports multicore and multiprocessor parallel processing.

You most likely are aware of the following, but I will mention it just in case. Natural Language Toolkit (NLTK) for Python (http://www.nltk.org) contains modules for clustering/classifying/categorizing text. Check the relevant chapter in the NLTK Book: http://www.nltk.org/book/ch06.html.

UPDATE:

Speaking of algorithms, it seems that you've tried most of the ones from scikit-learn, such as illustrated in this topic extraction example: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html. However, you may find useful other libraries, which implement a wide variety of clustering algorithms, including Non-Negative Matrix Factorization (NMF). One of such libraries is Python Matrix Factorization (PyMF) with home page at https://code.google.com/p/pymf and source code at https://github.com/nils-werner/pymf. Another, even more interesting, library, also Python-based, is NIMFA, which implements various NMF algorithms: http://nimfa.biolab.si. Here's a research paper, describing NIMFA: http://jmlr.org/papers/volume13/zitnik12a/zitnik12a.pdf. Here's an example from its documentation, which presents the solution for very similar text processing problem of topic clustering: http://nimfa.biolab.si/nimfa.examples.documents.html.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange