Pergunta

I'm currently working on a part-time project which involves predicting the likelihood of customers going to buy a product using data analytics. The company I'm interning with has given me a customer CSV file with all current customers and their attributes and needs to make a prediction model to classify whether prospects are feasible to pursue or not.

However since they have given me a list of all their successful customers or leads, in marketing terms, is it possible to train a model like K-means with PCA (and k-fold cross validation?) and get results? I have to train my model to fit a value, say 10, which I will add to the CSV, and further test it.

I am using pandas. Another issue is that there is a lot of demographical data, but I managed to overcome it using get_dummies(). The number of columns escalated from about 10 to 47, though.

I'm just entering into the world of data analysis, hence I'm a bit clueless as to what path to take or whether what I'm doing is right.

The exact analysis is called Predictive Lead Scoring/Analysis, in marketing terminology.

EDIT 1

I followed what @HonzaB did and, hence did get a decision tree. However, since I had 40 columns, it looks like this

I had to take a screenshot of it, as it was over 2 MB.

Obviously it's really big, and I have to prune the tree somehow, but I not sure how to do so on pandas. Also, is there any way that I can just generate the best characteristics as a text file or something that can be understood without the help of a data scientist?

EDIT 2

I've read up on a question that is quite similar to what I need to do. Predictive modeling based on RFM scoring indicators. In it there is a link to a paper([Data Mining using RFM Analysis][3]) that talks about rule-based classification. Ideally this is what I need to do, and what is most suitable to the company's need.

I want to know if it's possible to do this on Python/pandas. Or is it possible to traverse the decision tree and generate the rules?

EDIT 3

I found another website Decision trees in python again, cross-validation that uses cross validation and hyperparameter optimisation to get a better solution. Also they have included Python code to get readable code. It's a feasible solution, however it's quite complicated and I can't understand how it works. Will it work?

PS I solved the "really big decision-tree" problem from Edit 1, by reducing max-depth. I didn't know at all.

Foi útil?

Solução

First, I would ask the company if there is more information about the customer. You mentioned you have 10 original columns, which might not be enough to make a good prediction. Same goes for number of rows. Usually, more data, better the model, up to a certain limit.

Second, encode categorical features (demographical data in your case) is good thing to do. The increased number of columns dont have to bother you in your case.

For the task itself, yes, it is doable. Start easy, simply check importance of each feature (I would let PCA for later), pick few models and test them.

Also consider train simple decision tree. Your results can be easily visualized in way the business people understand. As oposed to black-box methods as K-Means.

Licenciado em: CC-BY-SA com atribuição
scroll top