Binning Technique to discretize continues data

https://stackoverflow.com/questions/13458603

30-11-2021
|

Question

I have a following data set,

Columns 1 through 6

1.0000         0    0.9954   -0.0589    0.8524    0.0231
1.0000         0    1.0000   -0.1883    0.9304   -0.3616
1.0000         0    1.0000   -0.0336    1.0000    0.0049
1.0000         0    1.0000   -0.4516    1.0000    1.0000
1.0000         0    1.0000   -0.0240    0.9414    0.0653
1.0000         0    0.0234   -0.0059   -0.0992   -0.1195
1.0000         0    0.9759   -0.1060    0.9460   -0.2080
     0         0         0         0         0         0
1.0000         0    0.9636   -0.0720    1.0000   -0.1433

I am trying to build decision tree using binary split one of the problem is data is continues and my current implementation become computationally intense by leaving the data as it is and doing the split. I must say this would be that bad if you are just building a one classifier.

In my case I am doing a ten-fold and increase classifiers from 5-50 (Bagging). I was thinking to do binning such way where data get bucket into 0.2 buckets but I realize there are negative numbers. I am using matlab for my implementation. I am a Matlab NewB and no sure if there are pre-define methods to handle scenarios like this.

Solution

Not sure whether this solves your question completely, but if your problem is defining the 'buckets' dynamically you can do this:

% Find the minimum and maximum of the matrix
Mmin = min(M(:));
Mmax = max(M(:));

% Assume you have a matrix M with positive and negative values, and want it in bins of 0.2
buckets = Mmin:0.2:Mmax;

% OR assume you want to spread them equally over a fixed amount of bins, say 100
buckets = linspace(Mmin,100,Mmax);

EDIT:

Suppose you want to devide the matrix based on the values of one column, say 3, then you can do it like this:

% Define the relevant column as a vector for easy handling
v = M(:,3);

% Assume you want to spread them equally over a fixed amount of bins, say 100
buckets = linspace(min(v),100,max(v));
% Now see which column belongs in each bucket
bucket_idx = ones(size(v));
for i = 2:length(buckets)
    bucket_idx(v>buckets(i-1)&(v<buckets(i)) = i;
end

This tells you in which bucket each row belongs, it would be nicer to vectorize this but at the moment this is the quickest solution I can think of. I think you should be able to solve the rest of the problem once you know in which bucket everything belongs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow