Question

Let's assume we have a collection of documents and wish to perform some unsupervised topic segmentation.

As always, we will perform some preprocessing (including tokenization, accent-removal, lowercasing, lemmatizing and such) and transform the lists of tokens into either raw-counts or tfidf-vectors. We'll call this matrix M.

Now we have several possible approaches to perform some simple bag-of-words topic segmentation:

  • Apply a matrix decomposition method (LSI, LDA, NMF) directly to M and use the resulting components as the topics.
  • Embed each vector of M into a semantic space (LSI, word2vec) and then apply a matrix decomposition method on the semantic space.
  • Apply a clustering method (kM, DBSCAN, MSC, GMM) directly to M.
  • Embed each vector of M into a semantic space and then apply a clustering method on the semantic space.

I have two questions:

  1. Are there any other alternatives to bag-of-words topic segmentation that I have not considered yet?
  2. What are the conceptual differences between the methods described above and which one(s) are recommended?

Thanks in advance!

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top