How to deal with possible data leakage in time series data?

https://datascience.stackexchange.com/questions/45575

01-11-2019
|

Question

I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan.

My issue is that for some customer in the data set, historical transactions are only available after the loan was issued. I believe using data after the loan event for prediction will cause data leakage.

This is a subtle leakage because it does not involve using information not available at prediction time. My concern is more about behavioral change when the customer is indebted that create a shift in the underlying distribution.

To test my hypothesis I was wondering if comparing whether the two samples before and after the loan is issued come from the same distribution will be a good approach.

These are my questions:

Is there really data leakage in the scenario I described
If yes, can I test it in any way?
Can a two samples test provide an answer? Which one? Note that the sample is composed of multivariate data
Can I do testing using any machine learning approach? I was thinking of using a Mixture Model to test for instance.

Any suggestion on how to best deal with this problem other than what I suggested will be appreciated.

Thanks

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange