Linear regression: Substituting the non-numerical discrete domain of a predictor with numerical one

StackOverflow https://stackoverflow.com/questions/19512863

質問

So I have a training set and the domain of one of the attributes is the following:

A = {Type1, Type2, Type3, ... ,Type5}

If the domain remains in that form I can't apply linear regression, because the mathematical hypothesis cant possibly work e.g.:

H = TxA + T1xB + T2xC + ...

(that is if we assume that all of the attributes are numerical except for the A attribute, then you cannot multiply a real-value parameter with a type )

Can I substitute the domain with numerical, equivalent, discrete values so I can do Linear Regression for this problem and be ok ?

A = {1, 2, 3, ...., 5 )

Is this the best practice ? If not can you please give me an alternative in those situations ?

役に立ちましたか?

解決

Best practice is to do a one-hot (one-of-K) encoding: for each value that A can take on, define a separate indicator feature. So with fives "types", A = type1 would be

[1, 0, 0, 0, 0]

and A = type3 is

[0, 0, 1, 0, 0]

Then concatenate these vectors with your other features so that your hypothesis becomes

H = w[Atype1] * [A=type1] + ... + w[Atype5] * [A=type5] + w[B] * B + ...

using [] to denote indicator functions.

This avoids the main problem with your approach, which is that you're introducing a number of (probably incorrect) biases, e.g. that type5 = type2 + type3. For further intuition why this is better than your encoding, see this answer of mine.

他のヒント

In general this won't work, because usually an average of nominal attributes doesn't make sense. For example if you assign Apple = 1, Banana = 2, Orange = 3 then in the model Banana would appear as an average of an Apple and an Orange. For classification tasks, consider using a perceptron, a neural network (using Winner-take-all paradigm eliminates the problem with average between nominal attributes), a decision tree or some other tools I forgot to mention. As correctly pointed out by larsmans a typical model for your case is the Logistic Regression.

Possibly you could also use WTA paradigm for linear regression - building a regression model for each of the output vector dimensions.

Clarification: WTA is the same as one-hot in larsmans's answer.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top