Question

i am trying to use scikit for the Naive Basyes classification. i have couple of question (Also i am new to scikit)

1) Scikit Algorithms want input as a numpy array and label as arrays. In case of text classification should i map each of my word with a number (id) , by maintaining a hash of words in vocab and a unique id associated with it? is this is standard practice in scikit?

2) In case of assigning same text to more than one class how should i proceed. One obvious way is to replicate each training example one for each associated label. Any better representation exist?

3) Similarly for the test data how will i get more than one class associated with a test?

I am using http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html as my base.

Was it helpful?

Solution

1) yes. Use DictVectorizer or HashVectorizer from the feature_extraction module. 2) This is a multilabel problem. Maybe use the OneVsRestClassifier from the multi_class module. It will train a separate classifier for each class. 3) Using a multilabel classifier / one classifier per calss will do that.

Take a look at http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html and http://scikit-learn.org/dev/auto_examples/plot_multilabel.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top