Binning long-tailed / pareto data before clustering
-
16-10-2019 - |
Question
I want to cluster a set of long-tailed / pareto-like data into several bins (actually the bin number is not determined yet).
Which algorithm or model would anyone recommend?
Solution
There are several approaches. You can start from the second one.
Equal-width (distance) partitioning:
It divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width of intervals will be:
W = (B-A)/N
.The most straightforward - Outliers may dominate presentation - Skewed data is not handled well.
Equal-depth (frequency) partitioning:
- It divides the range into N intervals, each containing approximately same number of samples
- Good data scaling
- Managing categorical attributes can be tricky.
Other Methods
Rank
: The rank of a number is its size relative to other values of a numerical variable. First, we sort the list of values, then we assign the position of a value as its rank. Same values receive the same rank but the presence of duplicate values affects the ranks of subsequent values (e.g., 1,2,3,3,5). Rank is a solid binning method with one major drawback, values can have different ranks in different lists.Quantiles (median, quartiles, percentiles, ...)
: Quantiles are also very useful binning methods but like Rank, one value can have different quantile if the list of values changes.Math functions
: For example, logarithmic binning is an effective method for the numerical variables with highly skewed distribution (e.g., income).
Entropy-based Binning
Entropy based method uses a split approach. The entropy (or the information content) is calculated based on the class label. Intuitively, it finds the best split so that the bins are as pure as possible that is the majority of the values in a bin correspond to have the same class label. Formally, it is characterized by finding the split with the maximal information gain.