Python - Overlapping Ranges - Determine unique positions

Question 1

I agree with jonrsharpe about the general approach, but I think there's a more elegant way to do it.

First, we'll get the ranges for each chromosome (pretty much the same as jonrsharpe, although I like tuples better than lists for the ranges).

from collections import defaultdict

processed = defaultdict(list)

for s in data:
    chr_, start, end = s.split(":")
    processed[chr_].append((int(start), int(end)))

Now, we can make the merging much simpler by sorting the list for each chromosome by the start of the range. This provides us with the nice property that if none of the previous ranges overlap with the current range, then we know that any merging we've done on the previous values is final and we won't have to go back to it.

for vals in processed.values():
    vals.sort()
    current = 1
    while current < len(vals):
      if vals[current-1][1] > vals[current][0]:
        # current and previous ranges overlap, so merge previous and current values.
        vals[current-1:current+1] = [(vals[current-1][0], vals[current][1])]
        # Because we reduced the number of values in the list by 1,
        # current now points at the next interesting value.
      else:
        current += 1 # We didn't merge, so we must increment current

Now we can put it back together as jonrsharpe does:

final = []
for key, vals in processed.items():
    for start, end in vals:
        final.append("%s:%s:%s" % (key, str(start), str(end)))

This also gives final == ['chr3:50:90', 'chr1:5:90', 'chr1:120:180']

Question 2

I would do this in three steps:

Split out the ranges for each chromosome;
Extract the contiguous ranges; and
Assemble the outputs as required ("chr:start:end").

Step one:

from collections import defaultdict

processed = defaultdict(list)

for s in data:
    chr_, pos = s.split(":", 1)
    processed[chr_].append(list(map(int, pos.split(":"))))

For

data == ['chr1:10:60', 'chr1:5:70', 'chr3:50:80', 
         'chr1:54:90', 'chr1:120:180', 'chr3:50:90']

this gives

processed == defaultdict(<class 'list'>, 
                         {'chr3': [[50, 80], [50, 90]], 
                          'chr1': [[10, 60], [5, 70], [54, 90], [120, 180]]})

We can now group these together based on overlaps

for vals in processed.values():
    finished = False
    while not finished:
        finished = True
        for i, (s1, e1) in enumerate(vals):
            for s2, e2 in vals[i+1:]:
                if ((s2 <= s1 and e2 >= s1) or
                    (s2 <= e1 and e2 >= e1)):
                   vals[i][0] = min(s1, s2)
                   vals[i][1] = max(e1, e2)
                   vals.remove([s2, e2])
                   finished = False

Which gets us to:

processed == defaultdict(<class 'list'>, 
                         {'chr3': [[50, 90]], 
                          'chr1': [[10, 90], [120, 180]]})

Now you can put it back together:

final = []
for key, vals in processed.items():
    for start, end in vals:
        final.append(":".join(map(str, (key, start, end))))

Which leaves:

final == ['chr3:50:90', 'chr1:10:90', 'chr1:120:180']