Topic Segmentation - should it be done in Raw, TfIdf or Semantic Space?

https://datascience.stackexchange.com/questions/37037

31-10-2019
|

Question

Let's assume we have a collection of documents and wish to perform some unsupervised topic segmentation.

As always, we will perform some preprocessing (including tokenization, accent-removal, lowercasing, lemmatizing and such) and transform the lists of tokens into either raw-counts or tfidf-vectors. We'll call this matrix M.

Now we have several possible approaches to perform some simple bag-of-words topic segmentation:

Apply a matrix decomposition method (LSI, LDA, NMF) directly to M and use the resulting components as the topics.
Embed each vector of M into a semantic space (LSI, word2vec) and then apply a matrix decomposition method on the semantic space.
Apply a clustering method (kM, DBSCAN, MSC, GMM) directly to M.
Embed each vector of M into a semantic space and then apply a clustering method on the semantic space.

I have two questions:

Are there any other alternatives to bag-of-words topic segmentation that I have not considered yet?
What are the conceptual differences between the methods described above and which one(s) are recommended?

Thanks in advance!

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange