Question

I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan.

My issue is that for some customer in the data set, historical transactions are only available after the loan was issued. I believe using data after the loan event for prediction will cause data leakage.

This is a subtle leakage because it does not involve using information not available at prediction time. My concern is more about behavioral change when the customer is indebted that create a shift in the underlying distribution.

To test my hypothesis I was wondering if comparing whether the two samples before and after the loan is issued come from the same distribution will be a good approach.

These are my questions:

  1. Is there really data leakage in the scenario I described

  2. If yes, can I test it in any way?

  3. Can a two samples test provide an answer? Which one? Note that the sample is composed of multivariate data

  4. Can I do testing using any machine learning approach? I was thinking of using a Mixture Model to test for instance.

Any suggestion on how to best deal with this problem other than what I suggested will be appreciated.

Thanks

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top