Question

I have to think about a model to identify prospects (companies) that have a high chance of being converted into clients, and I'm looking for advice on what kind of model could be of use.

The databases I will have are, as far as I know (I don't have them yet), the list of current clients (in other words, converted prospects) and their features (size, revenue, age, location, stuff like that), and a list of prospects (that I have to score) and their features. However, I don't think I'll have a list of the companies that used to be prospects but for which the conversion to clients failed (if I had, I think I could have opted for a random forest. Of course I could still use a random forest, but I feel it would be a bad idea to run a random forest on the union of my two databases, and treat the clients as converted and the prospects as non-converted...)

So I need to find, in the list of prospects, those who look like the already existing clients. What kind of model can I use to do that ?

(I'm also thinking about things such as "evaluating the value of the clients and apply this to the similar prospects", and "evaluating the chance each prospect has of going out of business" to further refine the value of my scoring, but it's kinda out of the scope of my question).

Thanks

Was it helpful?

Solution

I faced almost exactly the same scenario a year and a half ago -- basically what you have is a variation of the one-class classification (OCC) problem, specifically PU-learning (learning from Positive and Unlabelled data). You have your known, labelled positive dataset (clients) and an un-labelled dataset of prospects (some of which are client-like and some of which are not client-like). Your task is to identify the most client like of the prospects and target them... this hinges on the assumption that prospects that look most like clients are more likely to convert than prospects that look less like clients.

The approach we settled upon used a procedure called the Spy-technique. The basic idea is that you take a sample from your known positive class and inject them into your unlabelled set. You then train a classifier on this combined data and then run the unlabelled set back through the trained classifier assigning each instance a probability of being a positive class member. The intuition is that the injected positives (so-called spies) should behave similarly to the positive instances (as reflected by their posterior probabilities). By setting a threshold this allows you to extract reliable negative instances from the unlabelled set. Now, having both positive and negative labelled data, and you can build a classifier using any standard classification algorithm you choose. In essence, with the spy technique, you boot-strap your data to provide you with the needed negative instances for proper training.

spy technique

For starters you should look into the work of Li and Liu who have a number of papers exploring the topic of OCC and PU-learning.

OCC and PU-learning Papers

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top