سؤال

Almost all of the examples are based on numbers. In text documents i have words instead of numbers.

So can you show me simple examples of how to use these algorithms for text documents classification.

I don't need code example but just logic

Pseudocode would help greatly

هل كانت مفيدة؟

المحلول

The common approach is to use a bag of words model (http://en.wikipedia.org/wiki/Bag_of_words_model) where the classifier would learn the presence of words in a text, it is simple but works surprisingly well.

Also, here there is a similar question: Prepare data for text classification using Scikit Learn SVM

نصائح أخرى

You represent the terms that appear in documents as a weight in a vector, where each index position is the "weight" of a term. For instance, if we assume a document "hello world", and we associated position 0 with the importance of "hello" and position 1 with the importance of world, and we measure the importance as the number of times the term appears, the document is seen as d = (1, 1).

At the same time a document saying only "hello" would be (1, 0).

This representation could be base in any measure for the importance of terms in documents being the term frequency (as suggested by @Pedrom) the simplest option. The most common, yet simple enough, technique is to apply TF-IDF which combines how common a term is in the document and how rare is in the collection.

I hope this helps,

In bag of words model you you can use the term frequencies and assign weights to them according to their occurence in the new document and the training document. After that you can use the similarity function to calculate the similarity between the training and test documents.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top