Dummy variable on a vector with scikit onehotencoder

https://stackoverflow.com/questions/21634198

08-10-2022
|

Question

Let say I have a vector of integers, where every integers corresponds to a category:

A = [1, 2, 2, 3, 3, 1, 2, 4, 4, 1]

I know how many categories I have. This vector is one of the columns of my X dataset which will end in the logistic regression model.

Is it possible to use the sciki-tlearn function onehotencoder to obtain something like:

0 0 0 1 (when 1)
0 0 1 0 (when 2)
0 1 0 0 (when 3)
1 0 0 0 (when whatever)

or even better

0 0 0
0 0 1
0 1 0
1 0 0

When I try to pass such a vector to onehotencoder I obtain this error: need more than 1 value to unpack.

Furthermore: I suppose that if I have 'NULL' records I should first transform them in a number: is there a fast way to do it, like A(find(A=='NULL'))=123?

Thank you for your help. Francesco

Solution

OneHotEncoder input needs to be 2-d, not 1-d (it expects a set of samples).

>>> X = [[1, 2, 2, 3, 3, 1, 2, 4, 4, 1]]

Let's suppose that your categorical features can all take on four values:

>>> n_values = np.repeat(4, len(X[0]))
>>> n_values
array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4])

Then OneHotEncoder works fine:

>>> oh = OneHotEncoder(n_values=n_values)
>>> Xt = oh.fit_transform(X)
>>> Xt.toarray()
array([[ 0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,
         1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,
         0.]])
>>> Xt.shape
(1, 40)

It produces one dummy variable too many for each input variable, which is a bit wasteful. I've no idea what you mean by this NULL stuff since I don't know what your data looks like. You might want to open a separate question for that.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow