Question

I am studying classification algorithms using decision tree approach in Python. I would have some questions on this topic, specifically regarding the target (y) in my dataset.

I have a dateset made by 20000 observations and a few fields:

  • Customer
  • recorded date
  • amount
  • status (if married or not)
  • children (if any children in the family)
  • nationality (if American or not)

And so on.

Most of these fields are binary (yes/no). Based on this I would like to determine if this customer is trustworthy or not. As you can see, I have no label about trusting, but I have some initial information: for example the amount. If the amount is 0 or < 0, the customer has no money so he/she can be considered not trusted. Then, I could consider status: if he/she is married, then it could be considered trustworthy, as there could be another salary to take into account. And so on. My doubt are in splitting my dataset, as it asks about y variable. What would it be in this case? I have no explicit target..

Was it helpful?

Solution

When you do not have any target, and you want to label them as trustworthy or not, so here you are using your psychology that when customer is not earning money, or not married, then he/she is a bad customer. But manually labeling the datasets with this psychology may or may not be correct. Because you do not have any target variable to validate your labeling.

Therefore, as @Kappil C has suggested, first you need to categorize your data using some clustering algorithm to understand how your population is divided. It can be trustworth Vs non-trustworthy (2 classes). Or it can be super-trust-worthy, trust-worthy, non-trust-worthy (3 or more classes).

Once these classes are tagged, you are ready to proceed with any Supervised learning algorithm.

enter image description here

As against to this approach, you can proceed with simple rule-based technique also using basic statistics where you will understand each variable individually, and will create multiple rules independently. But again, you need to have target to find confusion matrix rule wise

Example:

People with age > 50 -> Super-trust-worthy

People with age < 18 -> Non-trust-worthy

and these rules will be helpful in streamlining your business.

OTHER TIPS

Use Clustering under unsupervised learning. That will categorize the customer based on similar parameters. You can define the number of cluster you need to form, in you case it is two(trustworthy and not). If there are more features it will be more helpful for the algorithm.

This might help.

https://towardsdatascience.com/an-introduction-to-clustering-algorithms-in-python-123438574097

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top