How to classify a small and peculiar subset out of a large database?

https://stackoverflow.com/questions/16886976

30-05-2022
|

Question

I have to perform a data mining task on a database containing informations about insurance policies. Each tuple indicates data about a single policy, along with information regarding the agency that issued it, the customer it is referring to and other fields. It is like a product between hypotetical tables Policies, Customers and Agencies. The fields are the following:

Policy Type,ID Number,Policy Status ,Product Description,Product Combinations,Issue Date,Effective Date,Maturity Date,Policy Duration,Loan Duration ,Cancellation Date ,Reason for cancellation,Total Premium,Splitter Premium,ID Partners,ID Agency,Country Agency,ID Zone,Agency potential,Sex Contractor ,Birth Year Contractor,Job Contractor,Sex Insured,Job Insured,Birth Year Insured,Product Area,Legal Form,ID Claim,Year Claim,Status Claim,Provision Claim,Payments Claim

This is an academic task and our professor wants us to identify churn rates, cross-selling and up-selling. I am not quite into the field and therefore I sought those terms on wikipedia. I started with churn rate and it appears to me that in this case I have to characterize the properties of customers whose Policy Status is set to "canceled" and the Reason for cancellation is "customer cancellation".

With Rapid Miner, I tried to apply decision trees and rule mining, but the subset of interest is so small that the output model, despite having a good accuracy overall, has a very very very poor accuracy in predicting canceled policies. This happens because the subset of canceled policies is really small. I also tried to apply the MetaCost operator with a given cost matrix in which the cost of misclassifying canceled policies is outrageously high with respect to the others (like a million times higher), but this did not change the result at all.

My best option now is to use the sequential covering algorithm for rule mining, but rapid miner does not implement it and I would have to code it manually.

Do you have any suggestion on how to build a good model for that small subset of canceled policies, so that we could use it to identify customers that would potentially cancel their policy in the future?

N.B.: since it comes from a real source, albeit anonymized, I cannot disclose the database or any data contained within.

Solution

Did you try Navie Bayes? It works well with small set of data. You can as well try a variant of it like AODE. AODE is not available in Rapid Miner. You should install Weka extension to access AODE in Rapid Miner.

OTHER TIPS

You need to balance your dataset, so that the classes (cancelled / not cancelled) are the same size. This means (temporarily) discarding lots of data.

You can use the Sample operator with the Balance Labels checkbox to do this.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow