문제

I'm working with an unbalanced classification problem, in which the target variable contains:

np.bincount(y_train)
array([151953,  13273])

i.e. 151953 zeroes and 13273 ones.

To deal with this I'm using XGBoost's weight parameter when defining the DMatrix:

dtrain = xgb.DMatrix(data=x_train, 
                     label=y_train,
                     weight=weights)

For the weights I've been using:

bc = np.bincount(y_train)
n_samples = bc.sum()
n_classes = len(bc)
weights = n_samples / (n_classes * bc)
w = weights[y_train.values]

Where weightsis array([0.54367469, 6.22413923]), and with the last line of code I'm just indexing it using the binary values in y_train. This seems like the correct approach to define the weights, since it represents the actual ratio between the amount of values of one class vs the other. However this seems to be favoring the minoritary class, which can be seen by inspecting the confusion matrix:

array([[18881, 19195],
       [  657,  2574]])

So just by trying out different weight values, I've realized that with a fairly close weight ratio, specifically array([1, 7]), the results seem much more reasonable:

array([[23020, 15056],
       [  837,  2394]])

So my question is:

  • Why using the actual weights of each class yielding poor metrics?
  • Which is the right way to set the weights for an unbalanced problem?
도움이 되었습니까?

해결책

Depending on your choice of accuracy metric, you'll find that different balancing ratios give the optimum value of the metric. To see why this is true, consider optimizing precision alone vs. optimizing recall alone. Precision is optimized (=1.0) when there are no false positives. Upweighting negative data reduces the positive rate, and therefore the false positive weight. So if you just want to optimize precision, give the positive data zero weight! You'll always predict negative labels and the precision will be ideal. Likewise, for only optimizing recall, give the negative data zero weight - you'll always get the ideal value of recall. These extreme cases are silly for real-world applications, but they do show that your "best" balancing ratio depends on your metric.

As you're probably aware, metrics like AUC and F1 try to compromise between precision and recall. In the absence of prior information, people often try to choose "equal balance" between precision and recall, as implemented in AUC. Since AUC is relatively insensitive to data balance, 1:1 data balancing is generally appropriate. However, in real life you may care more about precision than recall, or vice versa. So, you do need to select your metric in advance, depending on the problem you're solving. Then keep your metric fixed, vary your data balance, and look at your trained model performance on realistic test datasets. Then you can see whether your model is making the optimum predictions, from the point of view of your chosen metric and your real-world dataset.

다른 팁

Instance Weight File

XGBoost supports providing each instance an weight to differentiate the importance of instances. For example, if we provide an instance weight file for the "train.txt" file in the example as below:

train.txt.weight

1

0.5

0.5

1

0.5

It means that XGBoost will emphasize more on the first and fourth instance, that is to say positive instances while training. The configuration is similar to configuring the group information. If the instance file name is "xxx", XGBoost will check whether there is a file named "xxx.weight" in the same directory and if there is, will use the weights while training models.

  1. Important does not always equate with balanced.
  2. Dont even set weights, just make sure the problem is balanced, there are a ton of recources on this.
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top