Best way for data preparation to have accurate prediction

https://datascience.stackexchange.com/questions/18952

22-10-2019
|

Question

I'm trying to experiment if an opportunity will win or lose in Azure Machine Learning studio. However, am still in Data preparation method.

In my Data base I have opportunity table and products table.

For example, one opportunity has multiple products. Should I deal with the many products and put them in one record?

Will it affect the prediction if we have duplicate records for an opportunity like (a) or it’s better to have one record per opportunity in order to feed it to ML studio. And if yes which one will be better approach (b) or (c).

Approach a

Approach b

Approach c

oppid |first product |first technology|2nd product |2nd technology
1      out-services   active directory   TRN-items Adobe Acrobat

Solution

The simplification of the data may make the model more stable, but it will also remove its ability to use more specific input criteria. For example, the way you move from Approach A to Approach B you are aggregating specific products into product categories. This means that if your model is successful in Approach A, it will be able to predict based on specific products. On the other hand, if your model training succeeds in Approach B it will only be able to predict on product category (and you will have to convert products into its categories before supplying it to the model).

So to answer your questions, the number of data samples you have determine how much you have to aggregate and simplify your data. The data itself in its most detailed form could also fail to train the model properly, in which case the approach taken in Approach B is the best next step.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange