Train a model to determine that the probability of an event given a set of features is higher than when given a different set of features [closed]

https://datascience.stackexchange.com/questions/80418

13-12-2020
|

Question

I have a data set of attempted phone calls. I have a set of features, say, hour of day, and zip code. I have a label indicating whether the callee picked up the phone or not.

I want a model to predict the probability of a phone pick up given a instance's feature set

My difficulty is that I'm not interested in predicting whether the phone will be picked up or not, which would fit into a standard binary classification model, because I do not expect there to be very strong correlation between the features and the event. I'm merely hoping to discover that there is some boost in probability in a pick up for instance given its feature set. Then I could use that to prioritize phone numbers to attempt calling.

I don't think this fits neatly into a binary classifier model. What techniques/models can I look into for this problem.

Specifically, I'm looking for a model type to train with the data, that I can evaluate on a test set, and that will hopefully get better with more data.

I'm pretty new to this, as I'm sure you can tell, so any help would be greatly appreciated.

Solution

I'm not sure you need to depart from the binary classification paradigm. If you train a binary classification model using whether or not the phone is picked up as a label, then the trained model will end up sending inputs in the feature space into a mostly monotonic transform of the "actual" probability distribution (the transform will depend on your loss function and sample size). As long as you only care about ordinal optimization (i.e., you aren't bound by significant constraints in the feature space), then you could just utilize an optimization package, using your trained model as the function to be optimized and your feature space as the support. Maybe consider Scipy's optimize package for python and JuMP for julia.

If you want to optimize according to a relevant constraint (maybe one of your features costs money, etc), things might get tricky — you probably would need to use a |y_true-y_est| loss function along with a large sample size in order to push the implicit transform image toward the actual probability distribution, and the former could make convergence difficult.

If this is the case, I'm not sure trying a distribution-based approach (like you seem to be hinting at) would be worth it — you may just want to bake your constraints into the ML loss function.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange