Question

I'm using scikit-learn for finding the Tf-idf weight of a document and then using the Naive
Bayesian classifier to classify the text. But the Tf-idf weight of all words in a documents are negative except a few. But as far as I know, negative values means unimportant terms. So is it necessary to pass the whole Tf-idf values to the bayesian classifier? If we only need to pass only a few of those, how can we do it? Also how better or worse is a bayesian classifier compared to a linearSVC? Is there a better way to find tags in a text other than using Tf-idf ?

Thanks

Was it helpful?

Solution

You have a lot of questions there but I'll try to help.

As far as I remember, TF-IDF should not be a negative value. TF is the term frequency (how often a term appears in a particular document) and the inverse document frequency (# of documents in corpus / # of documents that include the term). That's then usually log weighted. We often add one to the denominator as well to avoid division by zero. Hence, the only time you would get a negative tf*idf is if the term appears in every single document of the corpus (which is not very helpful to search on as you mentioned since it doesn't add information). I would double check your algorithm.

given term t, document d, corpus c:

tfidf = term freq * log(document count / (document frequency + 1))
tfidf = [# of t in d] * log([#d in c] / ([#d with t in c] + 1))

In machine learning naive bayes and SVMs are both good tools--their quality will vary depending on the application and I've done projects where their accuracy turned out to be comparable. Naive Bayes is usually pretty easy to hack together by hand--I'd give that a shot first before venturing to SVM libraries.

I might be missing something but I'm not quite confident I know exactly what you're looking for--Happy to modify my answer.

OTHER TIPS

This bug has been fixed in the master branch. Beware as the text vectorizer API has changed a bit too to make it easier to customize the tokenization.

I am interesting in this theme too. When I am using baes classification (may be this russian article about baes algorithm can help you http://habrahabr.ru/blogs/python/120194/) I use only 20 top word of documents. I tried many values. In my exeperimental top 20 get best result. Also I changed usual tf-idf to this:

def f(word):
    idf = log10(0.5 / word.df)
    if idf < 0:
        idf = 0
    return word.tf * idf

In this case "bad words" wieght equal 0.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top