Ranking problem and imbalanced dataset

https://datascience.stackexchange.com/questions/85453

16-12-2020
|

Question

I know about the problems that imbalanced dataset will cause when we are working on classification problems. And I know the solution for that including undersampling and oversampling.

I have to work on a Ranking problem(Ranking hotels and evaluate based on NDCG50 score this link), and the dataset is extremely imbalanced. However, the example I saw on the internet use the dataset as it is and pass it to train_test_split without oversampling/undersampling.

I am kind of confused if that is true in the Ranking problems in which the imbalanced data does not matter and we do not need to fix this before passing the data to the model?

And if that is the case why?

Thanks

Solution

You are completely right, imbalance of labels does have an impact on ranking problems and people are using techniques to counter it.

The example in your notebook applies list-wise gradient boosting. Since pairwise ranking can be made list-wise by injecting the NDCG into the gradient, I will focus on pair-wise rank loss for the argument. I will base myself on this paper (https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf).

$C = -\bar{P}_{ij}$log$P_{ij} - (1 - \bar{P}_{ij})$log$(1 - P_{ij})$

with, $P_{ij}\equiv P(U_{i}\rhd U_{j})\equiv {1\over{1 + e^{-\sigma(s_{i} - s_{j})}}}$ and $\bar{P}_{ij} = {1\over2}(1 + S_{ij})$ for $S_{ij}$ being either $0$ or $1$.

This is actually just a classification problem, with 0 being article i being less relevance than article j and 1 in the opposite case.

Imagine that now you are working with queries which have a lot of matching documents but only a couple of documents have been tagged as relevant. Often such sparse tagging does not mean that ONLY these documents were relevant, but only is caused by the limitation of the estimation of relevance (https://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf).

Hence, it is not uncommon to down-sample high rated documents.

Another reason for applying imbalance methods such as reweighting labels, is for example bias due to position (see for example https://ciir-publications.cs.umass.edu/getpdf.php?id=1297). The loss is reweighted based on the observed position of the documents when their relevance was tagged.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange