How to use dummy variable to represent categorical data in python scikit-learn random forest

https://stackoverflow.com/questions/15821751

01-04-2022
|

Question

I'm generating feature vector for random forest classifier of scikit-learn . The feature vector represents the name of 9 protein amino acid residues. There are 20 possible residue names. So, I use 20 dummy variables to represent one residue name, for 9 residue, I have 180 dummy variables.

For example, if the 9 residues in the sliding window are: ARNDCQEGH (every one letter represent a name of a protein residue),my feature vector will be:

"True\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tFalse\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\n"

Also, I tried to use (1,0) to replace (True,False)

After training and testing Scikit's random forest classifier model, I found it totally did not work. But Scikit's random forest can work with my other numerical data.

Can Scikit's random forest deal with categorical variable or dummy variable? If so, could you provide an example showing how it works.

Here is how I set the random forest:

clf=RandomForestClassifier (n_estimators=800, criterion='gini', n_jobs=12, max_depth=None, compute_importances=True, max_features='auto', min_samples_split=1,  random_state=None)

Thanks a lot in advance!

Solution

Using boolean features encoded as 0 and 1 should work. If the predictive accuracy is bad even with a large number of decision trees in your forest it might be the case that your data is too noisy to get the learning algorithm to not pickup any think interesting.

Have you tried to fit a linear model (e.g. Logistic Regression) as a baseline on this data?

Edit: in practice using integer coding for categorical variables tends to work very well for many randomized decision trees models (such as RandomForest and ExtraTrees in scikit-learn).

OTHER TIPS

Scikits random forest classifier can work with dummified variables, but it can also use categorical variables directly, which is the preferred approach. Just map your strings into integers. Assume your features vector is ['a' ,'b', 'b', 'c']

vals = ['a','b','b','c']
#create a map from your variable names to unique integers:
intmap = dict([(val, i) for i, val in enumerate(set(vals))]) 
#make the new array hold corresponding integers instead of strings:
new_vals = [intmap[val] for val in vals]

new_vals now holds values [0, 2, 2, 1], and you can give it to RF directly, without doing the dummification

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow