Python OneHotEncoder Using Many Dummy Variables or better practice?

https://datascience.stackexchange.com/questions/36645

31-10-2019
|

Question

I am building a neural network and am at the point of using OneHotEncoder on many independent(categorical) variables. I would like to know if I am approaching this properly with dummy variables or if since all of my variables require dummy variables there may be a better way.

df  
    UserName    Token                       ThreadID    ChildEXE       
0   TAG     TokenElevationTypeDefault (1)   20788       splunk-MonitorNoHandle.exe  
1   TAG     TokenElevationTypeDefault (1)   19088       splunk-optimize.exe 
2   TAG     TokenElevationTypeDefault (1)   2840        net.exe 
807 User    TokenElevationTypeFull (2)      18740       E2CheckFileSync.exe 
808 User    TokenElevationTypeFull (2)      18740       E2check.exe 
809 User    TokenElevationTypeFull (2)      18740       E2check.exe 
811 Local   TokenElevationTypeFull (2)      18740       sc.exe  

ParentEXE           ChildFilePath               ParentFilePath   
splunkd.exe         C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0
splunkd.exe         C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0
dagent.exe          C:\Windows\System32         C:\Program Files\Dagent 0
wscript.exe         \Device\Mup\sysvol          C:\Windows  1
E2CheckFileSync.exe C:\Util                     \Device\Mup\sysvol\ 1
cmd.exe             C:\Windows\SysWOW64         C:\Util\E2Check 1
cmd.exe             C:\Windows                  C:\Windows\SysWOW64 1

DependentVariable
0
0
0
1
1
1
1

I import the data and using the LabelEncoder on the independent variables

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#IMPORT DATA
#Matrix x of features
X = df.iloc[:, 0:7].values
#Dependent variable
y = df.iloc[:, 7].values

#Encoding Independent Variable
#Need a label encoder for every categorical variable
#Converts categorical into number - set correct index of column
#Encode "UserName"
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
#Encode "Token"
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])
#Encode "ChildEXE"
labelencoder_X_3 = LabelEncoder()
X[:, 3] = labelencoder_X_3.fit_transform(X[:, 3])
#Encode "ParentEXE"
labelencoder_X_4 = LabelEncoder()
X[:, 4] = labelencoder_X_4.fit_transform(X[:, 4])
#Encode "ChildFilePath"
labelencoder_X_5 = LabelEncoder()
X[:, 5] = labelencoder_X_5.fit_transform(X[:, 5])
#Encode "ParentFilePath"
labelencoder_X_6 = LabelEncoder()
X[:, 6] = labelencoder_X_6.fit_transform(X[:, 6])

This gives me the following array:

X
array([[2, 0, 20788, ..., 46, 31, 24],
       [2, 0, 19088, ..., 46, 31, 24],
       [2, 0, 2840, ..., 27, 42, 15],
       ...,
       [2, 0, 20148, ..., 17, 40, 32],
       [2, 0, 20148, ..., 47, 23, 0],
       [2, 0, 3176, ..., 48, 42, 32]], dtype=object)

Now for all of the independent variables I have to create dummy variables:

Should I use:

onehotencoder = OneHotEncoder(categorical_features = [0, 1, 2, 3, 4, 5, 6])
X = onehotencoder.fit_transform(X).toarray()

Which gives me:

X
array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.]])

Or is there a better way to approach this this?

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange