Pergunta

In the context of a large-scale data mining benchmarking study, I am comparing 15 algorithms over 9 data sets, leading to an overall 135 algorithm/dataset combinations. The study is done using WEKA.

My last analysis is concerned with the influence of feature selection. I am aware, that there is no such thing as the perfect feature selection algorithm but the optimal choice rather depends on both algorithm to be deployed and the data set to which it will be applied.

Although the problem is to large to find the optimal feature selection algorithm for each combination, I am looking for ones that are considered to show a good performance in general, 'allrounder' so to say. So far I have found recommendation for CFS (Correlation-based feature selection), ReliefF and Consistency-based subset evaluation (Hall / Holmes 2002) as a generally good choice as well as the note from a survey, that methods as simple as Rankers (e.g. Correlation coefficient) proved quiet effective (Guyon / Ellissef 2003).

Is there a good benchmark study some other research indicating which methods to use or which ones to use in practice?

Foi útil?

Solução

From a Text Classification point of view, there is one article by Yang etal. comparing different feature selection algorithms (chi square, document frequency and Information Gain).

Although it is focus on text (i.e., the document frequency won't apply to you at all) the others might, depending on the nature of your features (i.e., binary or not, always present, ...)

I hope this helps.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top