Features used in sci-kit learn implementation of DT

https://stackoverflow.com/questions/19209921

30-06-2022
|

Pregunta

I have implemented a DT classifier with CV in sci-kit learn. However, I would like to also output the number of features that contributed to classification. This is the code that I have so far:

from collections import defaultdict

import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier

from scipy.sparse import csr_matrix

lemma2feat = defaultdict(lambda: defaultdict(float))  # { lemma: {feat : weight}}
lemma2cat = dict()
features = set()


with open("input.csv","rb") as infile:
    for line in infile:
        lemma, feature, weight, tClass = line.split()
        lemma2feat[lemma][feature] = float(weight)
        lemma2cat[lemma] = int(tClass)
        features.add(feature)

sorted_rows = sorted(lemma2feat.keys())
col2index = dict()
for colIdx, col in enumerate(sorted(list(features))):
    col2index[col] = colIdx

dMat = np.zeros((len(sorted_rows), len(col2index.keys())), dtype = float)


# popola la matrice
for vIdx, vector in enumerate(sorted_rows):
    for feature in lemma2feat[vector].keys():
        dMat[vIdx][col2index[feature]] = lemma2feat[vector][feature]

res = []
for lem in sorted_rows:
    res.append(lemma2cat[lem])


clf = DecisionTreeClassifier(random_state=0)


print "Acc:"
print cross_val_score(clf, dMat, np.asarray(res), cv=10, scoring = "accuracy")

What can I include to output the number of features, I looked at RFE for instance, as I inquired in a different question, but it can not be easily included with a DT. Therefore, I would like to know if there is a way to modify my above code to output also the number of features that contribute to the highest accuracy. The overall goal here would be then to plot in an elbow plot this in comparison with the output of other classifiers. Thank you.

Solución

You can inspect the relevant features using the feature_importances_ attribute once your tree is fit. It will give you an array of n_features float values such that feature_importances_[i] will be high (w.r.t to the other values) if the i-th feature was important/helpful to build the tree, and low (close to 0) if it was not.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow