Question

Have access to a dataset with hundreds of variables and millions of cases (American Community Survey).

Need to identify a small, manageable set of Independent Variables (IVs) to use for Multiple Regression.

One way to do this, of course, would be to use applicable theories to identify the IVs.

Was wondering how I could use a data-driven (data-mining?) approach as follows:

  • Use a Decision Tree to identify impactful (candidate? relevant?) IVs?
  • And then use these as the IVs in the Multiple Regression?

(Seem to remember reading once, in passing, that this approach to variable reduction is permitted.)

Tried searching on Google for articles that clarify the above, but the search terms are such that I keep getting hits to articles that compare Decision Trees and Multiple Regression.

So, if you know of articles and research papers that describe how to do the above, please leave links below. Also, I would welcome your own original suggestions on how to proceed.

Was it helpful?

Solution

Decision trees are useful for determining nested/interactive relationships between combinations of IVs and a DV.

The model you specified, a multiple regression, presupposes a relationship between the IVs and the DV (e.g. linear).

As you are aware, these models are different. So using a decision tree coupled with some importance measure to find predictive variables won't necessarily provide you with an optimal set of IVs in a regression model.

That being said, it can be a helpful exercise to inform you of non-linear relationships or interaction terms that could be predictive, and which may not be captured by specifying a model such as a multiple regression.

If I were you, I wouldn't solely rely on using decision trees to determine a set of IVs for a regression model. I would investigate penalized regression methods such as LASSO or ridge regression to help take you from a reduced candidate set of IVs to your final IVs. In addition, you might want to explore associative metrics related to your model specification that might be useful in exploring the relationships in your data, such as information values, chi-square tests, correlations, etc.

This may be helpful: https://stats.stackexchange.com/questions/47367/decision-tree-as-variable-selection-for-logistic-regression

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top