Pergunta

Problem statement :

We have documents with list of words in them. Overall these documents are classified into 2 group (say, good quality vs bad)

docs -

doc1 = [w1,w2,w3,w4]
doc2 = [w4,w3,w3,w4]
doc3 = [w2,w4,w8,w1]
doc4 = [w5,w4,w0,w9]

doc group -

good_grp = [doc2, doc1]
bad_grp = [doc3, doc4]

Now we have to find out which words actually are important to make the document good vs bad ?

Idea 1: Merge all words from documents that belong to document group 1 into single document say (good quality doc) and other one being (bad quality doc) and calculate tf-idf score per doc; but in this case we lose information of document level words and now just see document group level word importance.

doc1 = [w1,w2,w3,w4]
doc2 = [w4,w3,w3,w4]
doc3 = [w2,w4,w8,w1]
doc4 = [w5,w4,w0,w9]

good_grp = [w1,w2,w3,w4,w4,w3,w3,w4]
bad_grp = [w2,w4,w8,w1,w5,w4,w0,w9]

Can someone help me to direct to a better approach tf-idf or any other technique to solve this problem?

Foi útil?

Solução

I think here you must maintain the actual tf-idf and create corpus over it.. Assuming you already have lables for documents available. You can rum classification over it.

Best classification I am anticipating for this problem would be naive bayes..

Outras dicas

A direct way to find the words which are the most representative of a class is to calculate the probability of the class given a word:

$$p(c|w)=\frac{\#\{\ d\ |\ label(d)=c\ \land w\in d\}}{\#\{\ d\ |\ w\in d\ \}}$$

Ranking the words according to their probability $p(c|w)$ gives:

  • highest values: the most correlated words for the class
  • lowest values: the least correlated words for the class

Remark: with this method it's safer to filter out the least frequent words (e.g. remove the words with frequency lower than 3), because these are likely to happen by chance so they are not really representative.

Update:

One thing that worked the best with my data was converting the words into tf-idf vectors per document and applying Naive bayes on it to predict the probability per document or word.

Licenciado em: CC-BY-SA com atribuição
scroll top