Pergunta

This may sound very naive but i just wanted to be sure that when talking in Machine Learning terminology, features in Document Clustering is words which are chosen from a document, if some are discarded after stemming or as stop-words.

I am trying to use LibSvm library and it says that there are different approaches for different types of { no_of_instances, no_of_features }.

Like if no_of_instances is much lower than no_of_features, linear kernels would do. if both are large, linear would be fast. However, if no_of_features is small, non-linear kernels is better.

So for my document clustering/classification, i have small number of documents like 100 each may have words around 2000. So i fall in small no_of_instances and large no_of_features category depending upon what i consider a feature is.

I would like to use tf-idf for the document.

So Is no_of_features is the size of the vector that i get from tf-idf ?

Foi útil?

Solução

What you are talking about here is just one of possibilities, in fact the most trivial way of defining features for documents. In machine learning terminology feature is any mapping from the input space (in this particular example - from space of documents) into some abstract space, which is suited for a particular machine learning model. Most of the ML models (like neural networks, support vector machines, etc.) work on the numerical vectors, so features has to be mappings from documents to (constant size) vectors of numbers. This is a reason for sometimes choosing a representation of bag of owrds, where we have a words' count vector as a document representation. This limitation can be overcomed by using specific models, like for example Naive Bayes (or a custom kernel for SVM, which enables them to work with nonnumeric data), which can work on any objects, as long as we can define perticular conditional probabilities - here, the most basic approach is treating document containing a particular word or not as a "feature". In general this is not the only possibility, there are dozens of methods that use statistical features, semantic features (based on some ontologies like wordnet) etc.

To sum up - this is only one, most simple representation of document for the machine learning model. Good to start with, good to understand the basics, but far from being a "feature definition".

EDIT

no_of_features is size of the vector you use for your documents' representation, so if you use tf-idf, then size of resulting vecor is a no_of_featuers.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top