How does Xgboost learn what are the inputs for missing values?

https://datascience.stackexchange.com/questions/15305

xgboost

16-10-2019
|

Question

So from Algorithm 3 of https://arxiv.org/pdf/1603.02754v3.pdf, it says that an optimum default direction is determined and the missing values will go in that direction. However, or perhaps I have misunderstood/missed the explanation from the article, it doesn't say what exactly is the input.

For example, I have (parent) node A (with 50 inputs) splitting into node B and node C. Now, say of the 50 inputs there are 7 missing values.

The other 43 inputs are split into B and C accordingly. What I seem to be understanding is that, it will allocate the remaining 7 into B and C and determine which one gives a higher gain score; that will be the optimal direction.

However, given the 7 values are missing (Which means I don't know what are these 7 values), how does allocating missing values into any of the child nodes change the gain score, or rather minimize the loss function? This seems to suggest that Xgboost is inputting something for the missing values. I can't seem to find out what is Xgboost inputting for these missing values. I hope this question isn't too vague/general and easy.

Edit: I think "Missing values" may be a vague term. What I meant here is (From wiki) "In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation."

From the author himself (https://github.com/dmlc/xgboost/issues/21), he said " tqchen commented on Aug 13, 2014 xgboost naturally accepts sparse feature format, you can directly feed data in as sparse matrix, and only contains non-missing value.

i.e. features that are not presented in the sparse feature matrix are treated as 'missing'. XGBoost will handle it internally and you do not need to do anything on it."

And,

" tqchen commented on Aug 13, 2014 Internally, XGBoost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss."

Solution

The procedure is described in their paper, section 3.4: Sparsity aware split-finding.

Assume you're at your node with 50 observations and, for the sake of simplicity, that there's only one split point possible.

For example, you have only one binary feature $x$, and your data can be split in three groups:

Group $B$: 20 observations such that $x=B$,
Group $C$: 20 observations such that $x=C$,
Group $M$: 10 observations such that $x=?$

The algorithm will split based on $x$, but does not know where to send the group $M$. It will try both assignments,

$$(B, M), C \qquad \text{and} \qquad B, (C, M),$$

compute the value to be assigned to the prediction at each node using all the data, and chose the assignment that minimizes the loss.

For example, if the split $(B, M), C$ is chosen, the value of the left node will have been computed from all the $B$ and $M$ samples.

This is what is meant by

Automatically "learn" what is the best imputation value for missing values based on reduction on training loss.

Comment question:

how can group M be calculated if I don't know what are the values for them?

The value we compute is not based on the feature $x$, which we don't know, but based on the known label/target of the sample. Let say we are doing a regression and trying to minimize the mean square error, and let say we have the following:

For group $B$, the mean of the target is $5$.
For group $C$, the mean of the target is $10$.
For the missing value group $M$, the mean of the target is $0$.

Note that even if the value of $x$ is missing, we know the value of target of the sample, otherwise we could not use it to train our model.

In this case, the split would be made on $(B,M), C$, the value assigned to the right node containing samples $C$ would be $10$, and the value assigned to the left node containing samples in $(B, M)$ would be the mean of the target for the whole group. In our example, the mean would be

$$\frac{|M|}{|B| + |M|}\text{mean}(M) + \frac{|B|}{|M|+|B|}\text{mean}(B) = \frac{10}{30}0 + \frac{20}{30}5 = 3.\overline{3}$$

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange