Question

I have a set of 300,000 records of historic customer purchases data. I have started SSAS data mining project to identify best customers.

The split of data: -90% non-buyers -10% buyers

I have used various various algorithms of SSAS (decision trees and neural networks showed best lift) to explore my data.

The goal of the project is to identify/score customers according who is most likely to buy a product.

Currently, I have used all my records for this purpose. It feels that something is missing in the project. I am reading two books now about data mining. Both of them talk about splitting data mining into different sets; however, none of them explain HOW to actually split them.

I believe I need to split may records into 3 sets and re-run the ssas algorithms.

Main questions:

  1. How do I split data into training, validation and test sets 1.1 What ratio of buyers and non-buyers should be in a training set?
  2. How do I score my customers according to most likely to buy a product and least likely to buy a product.
Was it helpful?

Solution

The division of your set could be done randomly as your data set is big and the number of buyers is not too low (10%). However, if you want to be sure that your sets are representative you could take 80% of your buyers samples and 80% of non buyers samples and mix them to build a training set that contains 80% of your total data set and it has the same ratio of buyers-non buyers as the original data set which makes the subsets representative. You may want to divide your dataset not in two subsets but in three: training, crossvalidation and test. If you use a neural networkas you said you should use the crossvalidation subset to tune your model (weight decay, learning rate, momentum...).

Regarding your your second question you could use a neural network as you said and take the output, that will be in the range [0, 1] if you use a sigmoid as the activation function in the output layer, as the probability. I would also recommend you to take a look to collaborative filtering for this task because it would help you to know which products may be a customer interested in using your knowledge of other buyers with similar preferences.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top