質問

I am trying to better understand how the values of my feature vector may influence the result. For example, let's say I have the following vector with the final value being the result (this is a classification problem using an SVC, for example):

0.713, -0.076, -0.921, 0.498, 2.526, 0.573, -1.117, 1.682, -1.918, 0.251, 0.376, 0.025291666666667, -200, 9, 1

You'll notice that most of the values center around 0, however, there is one value that is orders of magnitude smaller, -200.

I'm concerned that this value is skewing the prediction and is being weighted unfairly heavier than the rest simply because the value is so much different.

Is this something to be concerned about when creating a feature vector? Or will the statistical test I use to evaluate my vector control for this large (or small) value based on the training set I provide it with? Are there methods available in sci-kit learn specifically that you would recommend to normalize the vector?

Thank you for your help!

役に立ちましたか?

解決

Yes, it is something you should be concerned about. SVM is heavily influenced by any feature scale variances, so you need a preprocessing technique in order to make it less probable, from the most popular ones:

  1. Linearly rescale each feature dimension to the [0,1] or [-1,1] interval
  2. Normalize each feature dimension so it has mean=0 and variance=1
  3. Decorrelate values by transformation sigma^(-1/2)*X where sigma = cov(X) (data covariance matrix)

each can be easily performed using scikit-learn (although in order to achieve the third one you will need a scipy for matrix square root and inversion)

他のヒント

I am trying to better understand how the values of my feature vector may influence the result.

Then here's the math for you. Let's take the linear kernel as a simple example. It takes a sample x and a support vector sv, and computes the dot product between them. A naive Python implementation of a dot product would be

def dot(x, sv):
    return sum(x_i * sv_i for x_i, sv_i in zip(x, sv))

Now if one of the features has a much more extreme range than all the others (either in x or in sv, or worse, in both), then the term corresponding to this feature will dominate the sum.

A similar situation arises with the polynomial and RBF kernels. The poly kernel is just a (shifted) power of the linear kernel:

def poly_kernel(x, sv, d, gamma):
    return (dot(x, sv) + gamma) ** d

and the RBF kernel is the square of the distance between x and sv, times a constant:

def rbf_kernel(x, sv, gamma):
    diff = [x_i - sv_i for x_i, sv_i in zip(x, sv)]
    return gamma * dot(diff, diff)

In each of these cases, if one feature has an extreme range, it will dominate the result and the other features will effectively be ignored, except to break ties.

scikit-learn tools to deal with this live in the sklearn.preprocessing module: MinMaxScaler, StandardScaler, Normalizer.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top