Pergunta

I have a big dataset where each value is divided into 3 parts [chromosome, start, end]. What is the fastest way to calculate all the unique positions per chromosome, because I have a lot of overlap ranges.

So for example:
[['chr1:10:60', 'chr1:5:70', 'chr3:50:80', 'chr1:54:90', 'chr1:120:180', 'chr3:50:90']]

should result in:
['chr1:5:90', 'chr1:120:180', 'chr3:50:90']

I don't know if there even is an easy way to calculate this? But I figured out its worth the try to ask it here. Down here is a subset of my data.

Thanks in advance,

[['chr9:95149330:95149362', 'chr9:95149330:95149362', 'chr17:70386266:70386304', 'chr17:70386256:70386304', 'chr2:44672786:44672833', 'chr2:44672785:44672833', 'chr2:141966446:141966479', 'chr2:141966446:141966488', 'chr19:18126909:18126938', 'chr19:18126909:18127027', 'chr3:145082003:145082051', 'chr3:145082014:145082121', 'chr6:38835529:38835560', 'chr6:38835529:38835560', 'chr4:120372932:120372986', 'chr4:120372932:120372994', 'chr2:141014019:141014057', 'chr2:141014014:141014057', 'chr18:3445722:3445761', 'chr18:3445722:3445793', 'chr17:72329982:72330015', 'chr17:72329982:72330015', 'chr5:169911920:169911962', 'chr5:169911917:169911962', 'chr4:146482176:146482219', 'chr4:146482176:146482219', 'chr9:104285900:104285935', 'chr9:104285879:104285935', 'chr12:32941976:32942016', 'chr12:32941976:32942028', 'chrX:127923156:127923189', 'chrX:127923156:127923189', 'chr2:9535703:9535755', 'chr2:9535701:9535755', 'chr8:86476618:86476684', 'chr8:86476554:86476642', 'chr9:135756650:135756696', 'chr9:135756650:135756706', 'chr6:103004873:103004932', 'chr6:103004861:103004918', 'chr8:86476618:86476684', 'chr8:86476556:86476648', 'chr1:52280846:52280876', 'chr1:52280845:52280876', 'chr8:86476635:86476685', 'chr8:86476553:86476645', 'chr5:116046573:116046620', 'chr5:116046564:116046615', 'chrX:68039214:68039252', 'chrX:68039214:68039252', 'chr4:181491919:181491953', 'chr4:181491919:181491960', 'chr18:68050122:68050166', 'chr18:68050122:68050166', 'chr2:233985816:233985860', 'chr2:233985808:233985860', 'chr6:17020712:17020750', 'chr6:17020712:17020759', 'chr7:21950625:21950666', 'chr7:21950625:21950666', 'chr12:93292486:93292536', 'chr12:93292481:93292537', 'chr1:246515439:246515472', 'chr1:246515440:246515486', 'chr12:57084093:57084130', 'chr12:57084093:57084134', 'chr1:174801431:174801474', 'chr1:174801431:174801485', 'chr7:92499684:92499734', 'chr7:92499924:92499960', 'chr17:40328527:40328560', 'chr17:40328518:40328560', 'chr8:42944072:42944110', 'chr8:42944073:42944120', 'chr17:29890450:29890499']
Foi útil?

Solução

I agree with jonrsharpe about the general approach, but I think there's a more elegant way to do it.

First, we'll get the ranges for each chromosome (pretty much the same as jonrsharpe, although I like tuples better than lists for the ranges).

from collections import defaultdict

processed = defaultdict(list)

for s in data:
    chr_, start, end = s.split(":")
    processed[chr_].append((int(start), int(end)))

Now, we can make the merging much simpler by sorting the list for each chromosome by the start of the range. This provides us with the nice property that if none of the previous ranges overlap with the current range, then we know that any merging we've done on the previous values is final and we won't have to go back to it.

for vals in processed.values():
    vals.sort()
    current = 1
    while current < len(vals):
      if vals[current-1][1] > vals[current][0]:
        # current and previous ranges overlap, so merge previous and current values.
        vals[current-1:current+1] = [(vals[current-1][0], vals[current][1])]
        # Because we reduced the number of values in the list by 1,
        # current now points at the next interesting value.
      else:
        current += 1 # We didn't merge, so we must increment current

Now we can put it back together as jonrsharpe does:

final = []
for key, vals in processed.items():
    for start, end in vals:
        final.append("%s:%s:%s" % (key, str(start), str(end)))

This also gives final == ['chr3:50:90', 'chr1:5:90', 'chr1:120:180']

Outras dicas

I would do this in three steps:

  1. Split out the ranges for each chromosome;
  2. Extract the contiguous ranges; and
  3. Assemble the outputs as required ("chr:start:end").

Step one:

from collections import defaultdict

processed = defaultdict(list)

for s in data:
    chr_, pos = s.split(":", 1)
    processed[chr_].append(list(map(int, pos.split(":"))))

For

data == ['chr1:10:60', 'chr1:5:70', 'chr3:50:80', 
         'chr1:54:90', 'chr1:120:180', 'chr3:50:90']

this gives

processed == defaultdict(<class 'list'>, 
                         {'chr3': [[50, 80], [50, 90]], 
                          'chr1': [[10, 60], [5, 70], [54, 90], [120, 180]]})

We can now group these together based on overlaps

for vals in processed.values():
    finished = False
    while not finished:
        finished = True
        for i, (s1, e1) in enumerate(vals):
            for s2, e2 in vals[i+1:]:
                if ((s2 <= s1 and e2 >= s1) or
                    (s2 <= e1 and e2 >= e1)):
                   vals[i][0] = min(s1, s2)
                   vals[i][1] = max(e1, e2)
                   vals.remove([s2, e2])
                   finished = False

Which gets us to:

processed == defaultdict(<class 'list'>, 
                         {'chr3': [[50, 90]], 
                          'chr1': [[10, 90], [120, 180]]})

Now you can put it back together:

final = []
for key, vals in processed.items():
    for start, end in vals:
        final.append(":".join(map(str, (key, start, end))))

Which leaves:

final == ['chr3:50:90', 'chr1:10:90', 'chr1:120:180']
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top