Following the same method as you are describing, namely doing feature selection and classification with two distinct Random Forest classifiers grouped into a Pipeline, I ran into the same issue.
An instance of the RandomForestClassifier class does not have an attribute called threshold. You can indeed manually add one, either using the way you described or with
setattr(object, 'threshold', 'mean')
but the main problem seems to be the way the get_params method checks for valid attributes of any member of BaseEstimator:
class BaseEstimator(object):
"""Base class for all estimators in scikit-learn
Notes
-----
All estimators should specify all the parameters that can be set
at the class level in their __init__ as explicit keyword
arguments (no *args, **kwargs).
"""
@classmethod
def _get_param_names(cls):
"""Get parameter names for the estimator"""
try:
# fetch the constructor or the original constructor before
# deprecation wrapping if any
init = getattr(cls.__init__, 'deprecated_original', cls.__init__)
# introspect the constructor arguments to find the model parameters
# to represent
args, varargs, kw, default = inspect.getargspec(init)
if not varargs is None:
raise RuntimeError("scikit-learn estimators should always "
"specify their parameters in the signature"
" of their __init__ (no varargs)."
" %s doesn't follow this convention."
% (cls, ))
# Remove 'self'
# XXX: This is going to fail if the init is a staticmethod, but
# who would do this?
args.pop(0)
except TypeError:
# No explicit __init__
args = []
args.sort()
return args
Indeed, as clearly specified, all estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments.
So I tried to specify threshold as an argument in the __init__ function with a default value to 'mean' (which is anyway its default value in the current implementation)
def __init__(self,
n_estimators=10,
criterion="gini",
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
max_features="auto",
bootstrap=True,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
min_density=None,
compute_importances=None,
threshold="mean"): # ADD THIS!
and then assign the value of this argument to a parameter of the class.
self.threshold = threshold # ADD THIS LINE SOMEWHERE IN THE FUNCTION __INIT__
Of course, this implies modifying the class RandomForestClassifier (in /python2.7/site-packages/sklearn/ensemble/forest.py) which might not be the best way... But it works for me! I am now able to grid search (and cross validate) over different threshold argument and thus different number of features selected.