Pergunta

The tf-idf discounts the words that appear in a lot of documents in the corpus. I am constructing an anomaly detection text classification algorithm that is trained only on valid documents. Later I use One-class SVM to detect outliers. Interesting enough the tf-idf performs worse than a simple count-vectorizer. First I was confused, but later it made sense to me, as tf-idf discounts attributes that are most indicative of a valid document. Therefore I was thinking of a new approach that would weight words that always appear in documents more, or rather assign a negative weight for the absence of such words. I have preset dictionary of words, so there is no worry that irrelevant words such as(is, that) will be weighted.

Do you have any ideas about such representations? The only thing I could imagine would be subtracting the document frequency from the attributes that are zero in a certain document.

Foi útil?

Solução

I'm not aware of any standard representation which increases the importance of document-frequent words, but IDF can simply be reverted: instead of the usual

$$idf(w,D)=\log\left(\frac{N}{|d\in D\ |\ w \in d|}\right)$$

you could use the following:

$$revidf(w,D)=\log\left(\frac{N}{|d\in D\ |\ w \notin d|}\right)$$

However for the task you describe I would be tempted to try some more advanced feature engineering, typically by using features which represent how close the distribution of words in the current document is from the average distribution.

Licenciado em: CC-BY-SA com atribuição
scroll top