Question

Problem

For my machine learning task, I create a set of predictors. Predictors come in "bundles" - multi-dimensional measurements (3 or 4 - dimensional in my case).

The hole "bundle" makes sense only if it has been measured, and taken all together.

The problem is, different 'bundles' of predictors can be measured only for small part of the sample, and those parts don't necessary intersect for different 'bundles'.

As parts are small, imputing leads to considerable decrease in accuracy(catastrophical to be more accurate)

Possible solutions

I could create dummy variables that would mark whether the measurement has taken place for each variable. The problem is, when random forests draws random variables, it does so individually.

So there are two basic ways to solve this problem: 1) Combine each "bundle" into one predictor. That is possible, but it seems information will be lost. 2) Make random forest draw variables not individually, but by obligatory "bundles".

Problem for random forest

As random forest draws variables randomly, it takes features that are useless (or much less useful) without other from their "bundle". I have a feeling that leads to a loss of accuracy.

Example

For example I have variables a,a_measure, b,b_measure. The problem is, variables a_measure make sense only if variable a is present, same for b. So I either have to combine aand a_measure into one variable, or make random forest draw both, in case at least one of them is drawn.

Question

What are the best practice solutions for problems when different sets of predictors are measured for small parts of overall population, and these sets of predictors come in obligatory "bundles"?

Thank you!

Was it helpful?

Solution

You may want to consider gradient boosted trees rather than random forests. They're also an ensemble tree-based method, but since this method doesn't sample dimensions, it won't run in to the problem of not having a useful predictor available to split on at any particular time.

Different implementations of GBDT have different ways of handling missing values, which will make a big difference in your case; I believe R does ternary splits which is likely to work fine.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top