Question

I was hoping to get a brief explanation of how TF-IDF produces features that can be used for machine learning. What are the differences between bag of words and TF-IDF? I understand how TF-IDF works; but not how features are made with it and how these are used in classification/regression.

I am using scikit-learn; what does the following code actually do theoretically and in practice? I have commented it with my understanding and some questions, any help would be really appreciated :

  traindata = list(np.array(p.read_table('data/train.tsv'))[:,2]) #taking in data for TF-IDF, I get this
  testdata = list(np.array(p.read_table('data/test.tsv'))[:,2]) #taking in data for TF-IDF, I get this
  y = np.array(p.read_table('data/train.tsv'))[:,-1] #labels for our data

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #making tf-idf object with params to dictate how it should behave

  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None) 

  X_all = traindata + testdata #adding data together
  lentrain = len(traindata) #what is this?
  tfv.fit(X_all) #is this where features are created? Are all words used as features? What happens here ?
  X_all = tfv.transform(X_all)#transforms our numpy array of text into a TF-IDF
  X = X_all[:lentrain]
  X_test = X_all[lentrain:]
  rd.fit(X,y) #train LR on newly made feature set with a feature for each word?
Was it helpful?

Solution

I guess idf is what make you confused here, since bag of words is the tf of word in the document, so why idf ? idf is a way to estimate how important the word is, usually, document frequency (df) is a good way to estimate how important a word in classfication, since when a word appear in less document (nba would always appear in documents belong to sports) show a better descrimination, so idf is in positive correlation to word's importance.

OTHER TIPS

Tf-idf is the most common vector representation for documents. It takes into account the frequencies of the words in a text and also in the whole document corpus. Obviously, this method is not scientifically backed, this means that it pragmatically works well in a bunch of contexts, like document similarity using cosine distance or other types of metrics, but was not derived from a mathematical proof.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top