Pergunta

i want to implement the algorithm of "Probability of Error and Average Correlation Coefficient". (more info Page 143. It is a algorithm to elect unused features from set of features. As far as i know, this algorithm is not limited to boolean valued features but i dont know how i can use it for continuous features.

This is the only example what i could find about this algorithm:

enter image descriptidfdfdfd

Thus, X is to be predicted feature and C is any feature. To calculate Probability of Error value of C, they select values which are mismatching with green pieces. Thus PoE of C is (1-7/9) + (1-6/7) = 3/16 = 1875.

My question is thus: How can we use a continuous feature instead of a boolean feature, like in this example, to calculate PoE? Or is it not possible?

Foi útil?

Solução

The algorithm that you describe is a feature selection algorithm, similar to the forward selection technique. At each step, we find a new feature Fi that minimizes this criterion :

weight_1 * ErrorProbability(Fi) + weight_2 * Acc(Fi)

ACC(Fi) represents the mean correlation between the feature Fi and other features already selected. You want to minimize this in order to have all your features not correlated, thus have a well conditionned problem.

ErrorProbability(Fi) represents if the feature correctly describes the variable you want to predict. For example, lets say you want to predict if tommorow will be rainy depending on temperature (continuous feature)

The Bayes error rate is (http://en.wikipedia.org/wiki/Bayes_error_rate) :

P = Sum_Ci { Integral_xeHi { P(x|Ci)*P(Ci) } }

In our example

  • Ci belong to {rainy ; not rainy}

  • x are instances of temperatures

  • Hi represent all temperatures that would lead to a Ci prediction.

What is interesting is that you can take any predictor you like.

Now, suppose you have all temperatures in one vector, all states rainy/not rainy in another vector :

In order to have P(x|Rainy), consider the following values :

temperaturesWhenRainy <- temperatures[which(state=='rainy')]

What you should do next is to plot an histogram of these values. Then you should try to fit a distribution on it. You will havea parametric formula of P(x|Rainy).

If your distribution is gaussian, you can do it simply :

m <- mean(temperaturesWhenRainy)
s <- sd(temperaturesWhenRainy)

Given some x value, you have the density of probability of P(x|Rainy) :

p <- dnorm(x, mean = m, sd = s)

You can do the same procedure for P(x|Not Rainy). Then P(Rainy) and P(Not Rainy) are easy to compute.

Once you have all that stuff you can use the Bayes error rate formula, which yields your ErrorProbability for a continuous feature.

Cheers

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top