Algorithm to Divide a list of numbers into 2 equal sum lists

https://stackoverflow.com/questions/890171

23-08-2019
|

Question

There is a list of numbers.

The list is to be divided into 2 equal sized lists, with a minimal difference in sum. The sums have to be printed.

#Example:
>>>que = [2,3,10,5,8,9,7,3,5,2]
>>>make_teams(que)
27 27

Is there an error in the following code algorithm for some case?

How do I optimize and/or pythonize this?

def make_teams(que):
    que.sort()
    if len(que)%2: que.insert(0,0)
    t1,t2 = [],[]
    while que:
    val = (que.pop(), que.pop())
    if sum(t1)>sum(t2):
        t2.append(val[0])
        t1.append(val[1])
    else:
        t1.append(val[0])
        t2.append(val[1])
    print min(sum(t1),sum(t2)), max(sum(t1),sum(t2)), "\n"

Question is from http://www.codechef.com/problems/TEAMSEL/

Solution

New Solution

This is a breadth-first search with heuristics culling. The tree is limited to a depth of players/2. The player sum limit is totalscores/2. With a player pool of 100, it took approximately 10 seconds to solve.

def team(t):
    iterations = range(2, len(t)/2+1)

    totalscore = sum(t)
    halftotalscore = totalscore/2.0

    oldmoves = {}

    for p in t:
        people_left = t[:]
        people_left.remove(p)
        oldmoves[p] = people_left

    if iterations == []:
        solution = min(map(lambda i: (abs(float(i)-halftotalscore), i), oldmoves.keys()))
        return (solution[1], sum(oldmoves[solution[1]]), oldmoves[solution[1]])

    for n in iterations:
        newmoves = {}
        for total, roster in oldmoves.iteritems():
            for p in roster:
                people_left = roster[:]
                people_left.remove(p)
                newtotal = total+p
                if newtotal > halftotalscore: continue
                newmoves[newtotal] = people_left
        oldmoves = newmoves

    solution = min(map(lambda i: (abs(float(i)-halftotalscore), i), oldmoves.keys()))
    return (solution[1], sum(oldmoves[solution[1]]), oldmoves[solution[1]])

print team([90,200,100])
print team([2,3,10,5,8,9,7,3,5,2])
print team([1,1,1,1,1,1,1,1,1,9])
print team([87,100,28,67,68,41,67,1])
print team([1, 1, 50, 50, 50, 1000])

#output
#(200, 190, [90, 100])
#(27, 27, [3, 9, 7, 3, 5])
#(5, 13, [1, 1, 1, 1, 9])
#(229, 230, [28, 67, 68, 67])
#(150, 1002, [1, 1, 1000])

Also note that I attempted to solve this using GS's description, but it is impossible to get enough information simply by storing the running totals. And if you stored both the number of items and totals, then it would be the same as this solution except you kept needless data. Because you only need to keep the n-1 and n iterations up to numplayers/2.

I had an old exhaustive one based on binomial coefficients (look in history). It solved the example problems of length 10 just fine, but then I saw that the competition had people of up to length 100.

OTHER TIPS

Dynamic programming is the solution you're looking for.

Example with [4, 3, 10, 3, 2, 5]:

X-Axis: Reachable sum of group.        max = sum(all numbers) / 2    (rounded up)
Y-Axis: Count elements in group.       max = count numbers / 2       (rounded up)

      1  2  3  4  5  6  7  8  9 10 11 12 13 14
 1  |  |  |  | 4|  |  |  |  |  |  |  |  |  |  |       //  4
 2  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
 3  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

      1  2  3  4  5  6  7  8  9 10 11 12 13 14
 1  |  |  | 3| 4|  |  |  |  |  |  |  |  |  |  |       //  3
 2  |  |  |  |  |  |  | 3|  |  |  |  |  |  |  |
 3  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

      1  2  3  4  5  6  7  8  9 10 11 12 13 14
 1  |  |  | 3| 4|  |  |  |  |  |10|  |  |  |  |       // 10
 2  |  |  |  |  |  |  | 3|  |  |  |  |  |10|10|
 3  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

      1  2  3  4  5  6  7  8  9 10 11 12 13 14
 1  |  |  | 3| 4|  |  |  |  |  |10|  |  |  |  |       //  3
 2  |  |  |  |  |  | 3| 3|  |  |  |  |  |10|10|
 3  |  |  |  |  |  |  |  |  |  | 3|  |  |  |  |

      1  2  3  4  5  6  7  8  9 10 11 12 13 14
 1  |  | 2| 3| 4|  |  |  |  |  |10|  |  |  |  |       //  2
 2  |  |  |  |  | 2| 3| 3|  |  |  |  | 2|10|10|
 3  |  |  |  |  |  |  |  | 2| 2| 3|  |  |  |  |

      1  2  3  4  5  6  7  8  9 10 11 12 13 14
 1  |  | 2| 3| 4| 5|  |  |  |  |10|  |  |  |  |       //  5
 2  |  |  |  |  | 2| 3| 3| 5| 5|  |  | 2|10|10|
 3  |  |  |  |  |  |  |  | 2| 2| 3| 5| 5|  |  |
                                       ^

12 is our lucky number! Backtracing to get the group:

12 - 5 = 7        {5}
 7 - 3 = 4        {5, 3}
 4 - 4 = 0        {5, 3, 4}

The other set can then be calculated: {4,3,10,3,2,5} - {5,3,4} = {10,3,2}

All fields with a number are possible solutions for one bag. Choose the one that is furthest in the bottom right corner.

BTW: It's called the knapsack-problem.

If all weights (w1, ..., wn and W) are nonnegative integers, the knapsack problem can be solved in pseudo-polynomial time using dynamic programming.

Well, you can find a solution to a percentage precision in polynomial time, but to actually find the optimal (absolute minimal difference) solution, the problem is NP-complete. This means that there is no polynomial time solution to the problem. As a result, even with a relatively small list of numbers, it is too compute intensive to solve. If you really need a solution, take a look at some of the approximation algorithms for this.

http://en.wikipedia.org/wiki/Subset_sum_problem

Q. Given a multiset S of integers, is there a way to partition S into two subsets S1 and S2 such that the sum of the numbers in S1 equals the sum of the numbers in S2?

A.Set Partition Problem.

Best of luck approximating. : )

Note that it is also an heuristic and I moved the sort out of the function.

 def g(data):
   sums = [0, 0]
   for pair in zip(data[::2], data[1::2]):
     item1, item2 = sorted(pair)
     sums = sorted([sums[0] + item2, sums[1] + item1])
   print sums

data = sorted([2,3,10,5,8,9,7,3,5,2])
g(data)

It's actually PARTITION, a special case of KNAPSACK.

It is NP Complete, with pseudo-polynomial dp algorithms. The pseudo in pseudo-polynomial refers to the fact that the run time depends on the range of the weights.

In general you will have to first decide if there is an exact solution before you can admit a heuristic solution.

A test case where your method doesn't work is

que = [1, 1, 50, 50, 50, 1000]

The problem is that you're analyzing things in pairs, and in this example, you want all the 50's to be in the same group. This should be solved though if you remove the pair analysis aspect and just do one entry at a time.

Here's the code that does this

def make_teams(que):
    que.sort()
    que.reverse()
    if len(que)%2: que.insert(0,0)
    t1,t2 = [],[]
    while que:
        if abs(len(t1)-len(t2))>=len(que):
            [t1, t2][len(t1)>len(t2)].append(que.pop(0))
        else:
            [t1, t2][sum(t1)>sum(t2)].append(que.pop(0))
    print min(sum(t1),sum(t2)), max(sum(t1),sum(t2)), "\n"

if __name__=="__main__":
    que = [2,3,10,5,8,9,7,3,5,2]
    make_teams(que)
    que = [1, 1, 50, 50, 50, 1000]
    make_teams(que)

This give 27, 27 and 150, 1002 which are the answers that make sense to me.

Edit: In review, I find this to not actually work, though in the end, I'm not quite sure why. I'll post my test code here though, as it might be useful. The test just generates random sequence that have equal sums, puts these together and compares (with sad results).

Edit #2: Based in the example pointed out by Unknown, [87,100,28,67,68,41,67,1], it's clear why my method doesn't work. Specifically, to solve this example, the two largest numbers need to both be added to the same sequence to get a valid solution.

def make_sequence():
    """return the sums and the sequence that's devided to make this sum"""
    while 1:
        seq_len = randint(5, 200)
        seq_max = [5, 10, 100, 1000, 1000000][randint(0,4)]
        seqs = [[], []]
        for i in range(seq_len):
            for j in (0, 1):
                seqs[j].append(randint(1, seq_max))
        diff = sum(seqs[0])-sum(seqs[1])
        if abs(diff)>=seq_max: 
            continue
        if diff<0:
            seqs[0][-1] += -diff
        else:
            seqs[1][-1] += diff
        return sum(seqs[0]), sum(seqs[1]), seqs[0], seqs[1]

if __name__=="__main__":

    for i in range(10):
        s0, s1, seq0, seq1 = make_sequence()
        t0, t1 = make_teams(seq0+seq1)
        print s0, s1, t0, t1
        if s0 != t0 or s1 != t1:
            print "FAILURE", s0, s1, t0, t1

They are obviously looking for a dynamic programming knapsack solution. So after my first effort (a pretty good original heuristic I thought), and my second effort (a really sneaky exact combinatorial solution that worked for shortish data sets, and even for sets up to 100 elements as long as the number of unique values was low), I finally succumbed to peer pressure and wrote the one they wanted (not too hard - handling duplicated entries was the trickiest part - the underlying algorithm I based it on only works if all the inputs are unique - I'm sure glad that long long is big enough to hold 50 bits!).

So for all the test data and awkward edge cases I put together while testing my first two efforts, it gives the same answer. At least for the ones I checked with the combinatorial solver, I know they're correct. But I'm still failing the submission with some wrong answer!

I'm not asking for anyone to fix my code here, but I would be very greatful if anyone can find a case for which the code below generates the wrong answer.

Thanks,

Graham

PS This code does always execute within the time limit but it is far from optimised. i'm keeping it simple until it passes the test, then I have some ideas to speed it up, maybe by a factor of 10 or more.

#include <stdio.h>

#define TRUE (0==0)
#define FALSE (0!=0)

static int debug = TRUE;

//int simple(const void *a, const void *b) {
//  return *(int *)a - *(int *)b;
//}

int main(int argc, char **argv) {
  int p[101];
  char *s, line[128];
  long long mask, c0[45001], c1[45001];
  int skill, players, target, i, j, tests, total = 0;

  debug = (argc == 2 && argv[1][0] == '-' && argv[1][1] == 'd' && argv[1][2] == '\0');

  s = fgets(line, 127, stdin);
  tests = atoi(s);
  while (tests --> 0) {

    for (i = 0; i < 45001; i++) {c0[i] = 0LL;}

    s = fgets(line, 127, stdin); /* blank line */
    s = fgets(line, 127, stdin); /* no of players */
    players = atoi(s);
    for (i = 0; i < players; i++) {s = fgets(line, 127, stdin); p[i] = atoi(s);}

    if (players == 1) {
      printf("0 %d\n", p[0]);
    } else {

    if (players&1) p[players++] = 0; // odd player fixed by adding a single player of 0 strength
    //qsort(p, players, sizeof(int), simple);

    total = 0; for ( i = 0; i < players; i++) total += p[i];
    target = total/2; // ok if total was odd and result rounded down - teams of n, n+1
    mask = 1LL << (((long long)players/2LL)-1LL);

    for (i = 0; i < players; i++) {
      for (j = 0; j <= target; j++) {c1[j] = 0LL;} // memset would be faster
      skill = p[i];
      //add this player to every other player and every partial subset
      for (j = 0; j <= target-skill; j++) {
        if (c0[j]) c1[j+skill] = c0[j]<<1;  // highest = highest j+skill for later optimising
      }
      c0[skill] |= 1; // so we don't add a skill number to itself unless it occurs more than once
      for (j = 0; j <= target; j++) {c0[j] |= c1[j];}
      if (c0[target]&mask) break; // early return for perfect fit!
    }

    for (i = target; i > 0; i--) {
      if (debug || (c0[i] & mask)) {
        fprintf(stdout, "%d %d\n", i, total-i);
        if (debug) {
          if (c0[i] & mask) printf("******** ["); else
          printf("         [");
          for (j = 0; j <= players; j++) if (c0[i] & (1LL<<(long long)j)) printf(" %d", j+1);
          printf(" ]\n");
        } else break;
      }
    }
    }
    if (tests) printf("\n");
  }
  return 0;
}

class Team(object):
    def __init__(self):
        self.members = []
        self.total = 0

    def add(self, m):
        self.members.append(m)
        self.total += m

    def __cmp__(self, other):
        return cmp(self.total, other.total)


def make_teams(ns):
    ns.sort(reverse = True)
    t1, t2 = Team(), Team()

    for n in ns:
        t = t1 if t1 < t2 else t2
        t.add(n)

    return t1, t2

if __name__ == "__main__":
    import sys
    t1, t2 = make_teams([int(s) for s in sys.argv[1:]])
    print t1.members, sum(t1.members)
    print t2.members, sum(t2.members)

>python two_piles.py 1 50 50 100
[50, 50] 100
[100, 1] 101

For performance you save computations by replacing append() and sum() with running totals.

You could tighten your loop up a little by using the following:

def make_teams(que):
    que.sort()
    t1, t2 = []
    while que:
        t1.append(que.pop())
        if sum(t1) > sum(t2):
            t2, t1 = t1, t2
    print min(sum(t1),sum(t2)), max(sum(t1),sum(t2))

Since the lists must me equal the problem isn't NP at all.

I split the sorted list with the pattern t1<-que(1st, last), t2<-que(2nd, last-1) ...

def make_teams2(que):
    que.sort()
    if len(que)%2: que.insert(0,0)
    t1 = []
    t2 = []
    while que:
        if len(que) > 2:
            t1.append(que.pop(0))
            t1.append(que.pop())
            t2.append(que.pop(0))
            t2.append(que.pop())
        else:
            t1.append(que.pop(0))
            t2.append(que.pop())
    print sum(t1), sum(t2), "\n"

Edit: I suppose that this is also a wrong method. Wrong results!

After some thinking, for not too big problem, I think that the best kind of heuristics will be something like:

import random
def f(data, nb_iter=20):
  diff = None
  sums = (None, None)
  for _ in xrange(nb_iter):
    random.shuffle(data)
    mid = len(data)/2
    sum1 = sum(data[:mid])
    sum2 = sum(data[mid:])
    if diff is None or abs(sum1 - sum2) < diff:
      sums = (sum1, sum2)
  print sums

You can adjust nb_iter if the problem is bigger.

It solves all the problem mentioned above mostly all the times.

In an earlier comment I hypothesized that the problem as set was tractable because they had carefully chosen the test data to be compatible with various algorithms within the time alloted. This turned out not to be the case - instead it is the problem constraints - numbers no higher than 450 and a final set no larger than 50 numbers is the key. These are compatible with solving the problem using the dynamic programming solution I put up in a later post. None of the other algorithms (heuristics, or exhaustive enumeration by a combinatorial pattern generator) can possibly work because there will be test cases large enough or hard enough to break those algorithms. It's rather annoying to be honest because those other solutions are more challenging and certainly more fun. Note that without a lot of extra work, the dynamic programming solution just says whether a solution is possible with N/2 for any given sum, but it doesn't tell you the contents of either partition.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow