문제

I am training a binary classifier in Python to estimate the level of risk of credit applicants. I extracted a little over a thousand independent variables to model the observed behavior of four million people. My target is a binary column that tells me whether or not a person defaulted on a loan (1 for event, 0 for non event).

I am asking this question because I feel overwhelmed by the dimensionality of the problem. I want to know some common and modern ways used to:

  1. drop features (dimensionality reduction)
  2. create new features based on combinations of other features (feature engineering)

So far, I dropped features based on their Information Values and kept only the most relevant ones. From the remaining set of features, I calculated each pair's correlation coefficient and for every highly correlated pair, I kept the feature with the highest information value of the two.

I now want to make new features based on this subset of remaining variables, such as ratios and multiplications (i.e. Number of open accounts divided by number of closed accounts). However, I think that my elimination process can be improved. My current method is very old school, as I have mostly relied on univariate analysis to drop features (one variable against the target).

도움이 되었습니까?

해결책

I'm not really knowledgeable about the modern techniques but I can tell you about the old ones: ;)

First there are two main approaches to dimensionality reduction: feature selection and feature extraction. You're using the former, which consists in discarding some of the original features. The latter consists in some kind of "merging" of similar variables, it can be worth trying especially if you have redundant features.

As you rightly noticed, feature selection based on individual features is rarely optimal. There are methods which can take the full set of features as a basis for selection, in particular genetic feature selection.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top