Come generare dinamicamente bande/gruppi di dati con numeri simili in ciascuno?

https://stackoverflow.com/questions/4520230

12-10-2019
|

Domanda

Voglio generare dinamicamente le bande, che saranno quindi raggruppate nei rapporti.

Il mio primo pensiero è stato generare le bande prendendo il valore minimo e il valore massimo e quindi dividendo la differenza.

Ad esempio, supponiamo che tu abbia avuto gli stipendi per un folto gruppo di persone:

Il più basso pagato guadagna £ 12.000 all'anno e il più alto guadagna £ 3.000.000
Quindi l'ho diviso in 10 bande di dimensioni simili: (£ 3mill - £ 12k) / 10 = £ 298800
Quindi la mia prima band va da £ 12k a £ 310.800 e ci riesce migliaia di persone
La mia seconda band va da £ 310k a £ 610k e ha qualche centinaio
Ogni altra band ha alcune persone in ognuna

Quindi questo non è in realtà molto utile. Se dovessi creare manualmente le band, vorrei numeri approssimativamente simili in ciascuno, qualcosa come: £ 12k- £ 14k, £ 14k- £ 18K, £ 18K- £ 25k, £ 25- £ 35k, ..., £ 1,5- £ 3 milioni

Questo è solo un esempio: potrebbero esserci molte distribuzioni diverse.

Sto cercando un algoritmo per generare le bande, in modo che gli utenti inseriscano quante band vogliono e i dati sarebbero raggruppati in quelle bande con un numero simile in ciascuna.

Il banding deve essere veloce: non riesco a fare il giro dell'intero set di dati.

L'applicazione è C# oltre a SQL, ma le soluzioni di altre lingue sono benvenute.

Soluzione

Penso che tu stia chiedendo come interrogare un set di dati esistente nelle "bande" ...

Se questo è vero, allora Oracle supporta le funzioni aggregate per lo scopo. Dovrebbero esserci equivalenti in altre implementazioni SQL.

Altri suggerimenti

Have you looked at NTILE? SQL Server and most dbms support it.

For instance:

select b.band, count(*), min(b.valuefield), max(b.valuefield)
from ( 
    select ntile(10) over (order by valuefield) as 'band', valuefield
    from table ) b
group by b.band

You are looking at the problem from the wrong point of view. Instead of looking at the salary look at the ordered position of the person in the sorted range of salaries. Put the algorithm aside for a second and think about it mathematically.

Take all your people and sort them by salary. Now sequentially number them from 1 on up to n, the last one with the highest salary. If you need m groups, then each group contains n/m people. So the first salary band goes from 0 up to person[n/m].Salary, the second goes from there to person[2*n/m].Salary and so on up.

In C# you can do this fairly efficiently in Linq. Something like this. This is untested code, this is a concept not a final solution, there are probably some edge condition problems that I haven't thought about right.

List<int> GetBands(int numBands)
{
    using(var db = new MyContext())    
    {
        var list SalaryBands = new List<int>();
        var count = db.People.Count();
        var salaries = db.People.OrderBy(item => item.Salary)
                                .Select(item => item.Salary);
        int skipCount = count / numBands;
        for(int segmentNum = 0; segmentNum < numBands; segmentCount++)
        {
            salaries = salaries.Skip(skipCount);
            salaryBands.Add(salaries.First());
        }
        return salaryBands;
    }
}

First observation, you want a log-like graph, as opposed to straight linear.

Second observation: I usually build large sample datasets (akin to your given example) and then look for my common factors and derive a formulaic system from the actual data. Can you posit some more scenarios?

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow