Python OneHotEncoder Using Many Dummy Variables or better practice?
-
31-10-2019 - |
Question
I am building a neural network and am at the point of using OneHotEncoder on many independent(categorical) variables. I would like to know if I am approaching this properly with dummy variables or if since all of my variables require dummy variables there may be a better way.
df
UserName Token ThreadID ChildEXE
0 TAG TokenElevationTypeDefault (1) 20788 splunk-MonitorNoHandle.exe
1 TAG TokenElevationTypeDefault (1) 19088 splunk-optimize.exe
2 TAG TokenElevationTypeDefault (1) 2840 net.exe
807 User TokenElevationTypeFull (2) 18740 E2CheckFileSync.exe
808 User TokenElevationTypeFull (2) 18740 E2check.exe
809 User TokenElevationTypeFull (2) 18740 E2check.exe
811 Local TokenElevationTypeFull (2) 18740 sc.exe
ParentEXE ChildFilePath ParentFilePath
splunkd.exe C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0
splunkd.exe C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0
dagent.exe C:\Windows\System32 C:\Program Files\Dagent 0
wscript.exe \Device\Mup\sysvol C:\Windows 1
E2CheckFileSync.exe C:\Util \Device\Mup\sysvol\ 1
cmd.exe C:\Windows\SysWOW64 C:\Util\E2Check 1
cmd.exe C:\Windows C:\Windows\SysWOW64 1
DependentVariable
0
0
0
1
1
1
1
I import the data and using the LabelEncoder on the independent variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#IMPORT DATA
#Matrix x of features
X = df.iloc[:, 0:7].values
#Dependent variable
y = df.iloc[:, 7].values
#Encoding Independent Variable
#Need a label encoder for every categorical variable
#Converts categorical into number - set correct index of column
#Encode "UserName"
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
#Encode "Token"
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])
#Encode "ChildEXE"
labelencoder_X_3 = LabelEncoder()
X[:, 3] = labelencoder_X_3.fit_transform(X[:, 3])
#Encode "ParentEXE"
labelencoder_X_4 = LabelEncoder()
X[:, 4] = labelencoder_X_4.fit_transform(X[:, 4])
#Encode "ChildFilePath"
labelencoder_X_5 = LabelEncoder()
X[:, 5] = labelencoder_X_5.fit_transform(X[:, 5])
#Encode "ParentFilePath"
labelencoder_X_6 = LabelEncoder()
X[:, 6] = labelencoder_X_6.fit_transform(X[:, 6])
This gives me the following array:
X
array([[2, 0, 20788, ..., 46, 31, 24],
[2, 0, 19088, ..., 46, 31, 24],
[2, 0, 2840, ..., 27, 42, 15],
...,
[2, 0, 20148, ..., 17, 40, 32],
[2, 0, 20148, ..., 47, 23, 0],
[2, 0, 3176, ..., 48, 42, 32]], dtype=object)
Now for all of the independent variables I have to create dummy variables:
Should I use:
onehotencoder = OneHotEncoder(categorical_features = [0, 1, 2, 3, 4, 5, 6])
X = onehotencoder.fit_transform(X).toarray()
Which gives me:
X
array([[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
...,
[0., 0., 1., ..., 1., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 1., 0., 0.]])
Or is there a better way to approach this this?
No correct solution
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange