Question

Does anyone know a good algorithm for perform clustering on both discrete and continuous attributes? I am working on a problem of identifying a group of similar customers and each customer has both discrete and continuous attributes (Think type of customers, amount of revenue generated by this customer, geographic location and etc..)

Traditionally algorithm like K-means or EM work for continuous attributes, what if we have a mix of continuous and discrete attributes?

Was it helpful?

Solution

If I remember correctly, then COBWEB algorithm could work with discrete attributes.

And you can also do different 'tricks' to the discrete attributes in order to create meaningful distance metrics.

You could google for clustering of categorical/discrete attributes, one of the first hits: ROCK: A Robust Clustering Algorithm for Categorical Attributes.

OTHER TIPS

R is a great tool for clustering - the standard approach would be to calculate a dissimilarity matrix on your mixed data using daisy, then clustering with that matrix using agnes.

The cba module on CRAN includes a function to cluster on binary predictors based on ROCK.

You could also look at affinity propagation as a possible solution. But to overcome the continuous / discrete dilemma you need to define a function that values the discrete states.

I would actually present pairs of the discrete attributes to users and ask them to define their proximity. You would present them with a scale reaching from [synonym..very foreign] or similar. Having many people do this you will end up with a widely accepted proximity function for the non-linear attribute values.

How about transforming each of your categorical attributes into a series of N-1 binary indicator attributes (where N is the number of categories)? You shouldn't be afraid of high dimensionality, as a sparse representation (such as mahout's SequentialAccessSparseVector can be employed). Once you do that, you can use a classical K-means or whatever standard numeric-only clustering algorithm.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top