Question

I'm new in text mining and I have a very big text file where every line represents a review about an item (a sentence).

I would like to find both the groups and the topics that exist within the reviews. So my question is what are the features, groups and topics of my data? Could the occurence frequency of each word be used as features? Do we have to consider every line (review) as a document itself then we have to cluster the reviews? I'm also wondering if the number of groups or topic should be known in prior since in any unsupervised algorithm the number of clusters is supposed to be a known parameter.

My second question is how can I edit this k-means clustering code to find the groups and the NMF code to find topics using my reviews.txt file.

Was it helpful?

Solution

Firstly, as suggested in the comments, you can grab the basics from a good book on text mining or information retrieval. My suggestions is: Introduction to Information Retrieval.

Now trying to briefly answer your queries:

//my question is what are the features// - As in most text mining problems, features in your case could be terms (words) in every sentence. You can estimate the term frequencies and use TF-IDF representation,a very popular way of representing documents.

//groups// - Since every sentence represents an individual review, you can think of every sentence as a tiny document and use document clustering to identify the groups.

//topics of my data?// - Yes, there is something called topic modelling, which will help you to identify the topics in from a collection of documents. But, not sure if it applies to your problem.

//Do we have to consider every line (review) as a document itself then we have to cluster the reviews? // - Yes.

//I'm also wondering if the number of groups or topic should be known in prior since in any unsupervised algorithm the number of clusters is supposed to be a known parameter.// - This is not really the case. Many clustering algorithms do not expect prior knowledge on no. of clusters, such as hierarchical clustering, affinity propagation. Even for algorithms which expect the no. of clusters, there are a number of ways to predict this.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top