Dividing a list of numbers into roughtly equal totals

https://stackoverflow.com/questions/4151457

08-10-2019
|

문제

I'm aware that there probably isn't a "perfect" solution to my question (this sounds like a variation of the knapsack or the bin packing problem), but here's my scenario:

I want to divide a list of SQL database tables into n (let's say 7) roughly equally-sized piles (so I can spread some maintenance tasks roughly equally across an entire week).

Let's say I have 100 tables (this could be higher or lower, but not likely higher than 5000), ranging from size 1 to size 10,000,000 (larger tables are much less common, of course).

My original idea was to sort the tables alphabetically (pseudo-randomly) then walk through from the beginning, moving to the next group when the total exceeds Sum(Size)/7. For some databases, this will probably work fine, but if two giant tables are right next to each other, then this makes for very unequal groups. (This isn't as unlikely as it sounds, consider two huge tables, Account_History and Account_History_Archive).

Are there any generally accepted techniques for this, that give "good" results with a variety of source data? I'd lean toward a simpler technique rather than a more precise grouping (if the maintenance runs slightly longer on some days than others, its not that big of deal).

해결책

How about sorting the tables by size, then for each table, put it into the day that currently has the smallest total number of rows in it? This means the biggest 7 tables will first spread across the days. Then the 8th biggest will go with the smallest of the first 7, etc. You'll continue to fill in the day with the least amount of work scheduled to it.

Where the little reference tables end up finally probably does not make much difference.

You could invent scenarios where this is no good, but I expect it will work in practice without being too complicated.

다른 팁

I don't know how this rates on the good code scale, but a solution i'd pursue is to put the job list into a priority queue, sorted by most costly, and the worker bins into another priority queue, sorted by least work assigned, and then just pop jobs off of one queue and assign them to the top (least busy) worker bin, until no work remains.

Just for reference, this is how I went about this. I wanted to put the "Buckets" into a persisted table and only "recompute" them every 2 weeks. Otherwise, I feared that if I computed these buckets every day, a table could jump from one bucket to the next. But, I wanted to recompute every so often for schema & DDL modifications. Here's that snippet.

-------------------------------------------------------------------------------------
--Get the total table size (by rows)
-------------------------------------------------------------------------------------


if object_id('tempdb..#Space') is not null
drop table #Space

SELECT 
    TableName = t.NAME,
    Schem = s.name,
    Pages = sum(a.total_pages),
    Grp = row_number() over (order by sum(a.total_pages) desc)
INTO #Space
FROM 
    sys.tables t
INNER JOIN      
    sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN 
    sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN 
    sys.allocation_units a ON p.partition_id = a.container_id
LEFT OUTER JOIN 
    sys.schemas s ON t.schema_id = s.schema_id
WHERE 
    t.NAME NOT LIKE 'dt%' 
    AND t.is_ms_shipped = 0
    AND i.OBJECT_ID > 255 
GROUP BY 
    t.Name, s.name


-------------------------------------------------------------------------------------
--split the tables into 7 buckets by:
    --updating the Grp to the Grp with the lowest cumulative sum of all members by
    --ordering by the current cumulative sum of all members
-------------------------------------------------------------------------------------

declare @ct int = 8


while @ct <= (select max(Grp) from #Space)
begin

    update S
    set Grp = (select top 1 Grp from #Space where Grp < 8 order by sum(Pages) over (partition by Grp) asc)
    from #Space S
    where S.Grp = @ct

    set @ct = @ct + 1

end


insert into AdminTools..TableSpace (TableName
                                    ,Schem
                                    ,Pages
                                    ,Grp
                                    ,GrpPages
                                    ,LoadDate)
select 
    TableName
    ,Schem
    ,Pages
    ,Grp
    ,GrpPages = sum(Pages) over (partition by Grp)
    ,LoadDate = getdate()
from #Space
end

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow