Question

I want to cluster a set of long-tailed / pareto-like data into several bins (actually the bin number is not determined yet).

Which algorithm or model would anyone recommend?

Was it helpful?

Solution

There are several approaches. You can start from the second one.

Equal-width (distance) partitioning:

  • It divides the range into N intervals of equal size: uniform grid

  • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.

  • The most straightforward - Outliers may dominate presentation - Skewed data is not handled well.

Equal-depth (frequency) partitioning:

  • It divides the range into N intervals, each containing approximately same number of samples
  • Good data scaling
  • Managing categorical attributes can be tricky.

Other Methods

  • Rank: The rank of a number is its size relative to other values of a numerical variable. First, we sort the list of values, then we assign the position of a value as its rank. Same values receive the same rank but the presence of duplicate values affects the ranks of subsequent values (e.g., 1,2,3,3,5). Rank is a solid binning method with one major drawback, values can have different ranks in different lists.
  • Quantiles (median, quartiles, percentiles, ...): Quantiles are also very useful binning methods but like Rank, one value can have different quantile if the list of values changes.
  • Math functions: For example, logarithmic binning is an effective method for the numerical variables with highly skewed distribution (e.g., income).

Entropy-based Binning

Entropy based method uses a split approach. The entropy (or the information content) is calculated based on the class label. Intuitively, it finds the best split so that the bins are as pure as possible that is the majority of the values in a bin correspond to have the same class label. Formally, it is characterized by finding the split with the maximal information gain.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top