How to dynamically generate bands/groups of data with similar numbers in each?

https://stackoverflow.com/questions/4520230

12-10-2019
|

Question

I want to dynamically generate bands, that will then be grouped in reports.

My first thought was generate the bands by taking the minimum value and the maximum value and then dividing up the difference.

For instance suppose you had the salaries for a large group of people:

The lowest paid earns £12,000 a year and the highest earns £3,000,000
So I split that into 10 bands of similar size: (£3mill - £12k) / 10 = £298800
So my first band goes £12k to £310,800 and gets thousands of people in it
My second band goes £310k to £610k and has a few hundred
Every other band has a few people in each one

So then this isn't actually very useful. If I were to manually create the bands I'd want roughly similar numbers in each, something like: £12k-£14k, £14k-£18k, £18k-£25k, £25-£35k, ..., £1.5-£3million

This is just one example - there could be lots of different distributions.

I'm looking for an algorithm to generate the bands, so users would enter how many bands they want and the data would be grouped into that many bands with a similar number in each.

The banding needs to be quick - I can't just loop through the entire dataset.

The application is C# on top of SQL, but solutions from other languages welcome.

Solution

i think you are asking about how to query an existing dataset into the 'bands'...

if this is true, then Oracle supports NTILE aggregate functions fo rthis purpose. There should be equivalents in other SQL implementations.

OTHER TIPS

Have you looked at NTILE? SQL Server and most dbms support it.

For instance:

select b.band, count(*), min(b.valuefield), max(b.valuefield)
from ( 
    select ntile(10) over (order by valuefield) as 'band', valuefield
    from table ) b
group by b.band

You are looking at the problem from the wrong point of view. Instead of looking at the salary look at the ordered position of the person in the sorted range of salaries. Put the algorithm aside for a second and think about it mathematically.

Take all your people and sort them by salary. Now sequentially number them from 1 on up to n, the last one with the highest salary. If you need m groups, then each group contains n/m people. So the first salary band goes from 0 up to person[n/m].Salary, the second goes from there to person[2*n/m].Salary and so on up.

In C# you can do this fairly efficiently in Linq. Something like this. This is untested code, this is a concept not a final solution, there are probably some edge condition problems that I haven't thought about right.

List<int> GetBands(int numBands)
{
    using(var db = new MyContext())    
    {
        var list SalaryBands = new List<int>();
        var count = db.People.Count();
        var salaries = db.People.OrderBy(item => item.Salary)
                                .Select(item => item.Salary);
        int skipCount = count / numBands;
        for(int segmentNum = 0; segmentNum < numBands; segmentCount++)
        {
            salaries = salaries.Skip(skipCount);
            salaryBands.Add(salaries.First());
        }
        return salaryBands;
    }
}

First observation, you want a log-like graph, as opposed to straight linear.

Second observation: I usually build large sample datasets (akin to your given example) and then look for my common factors and derive a formulaic system from the actual data. Can you posit some more scenarios?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow