Question

I have a huge dataframe df which includes information about overlapping intervals (A) and (B) and on which chromosome (chrom) they were located. There is also information about a value (level of gene expression) observed over interval (A).

chrom value    Astart      Aend    Bstart      Bend
 chr1     0         0  54519752     17408     17431
 chr1     0         0  54519752     17368     17391
 chr1     0         0  54519752    567761    567783
chr11     0         2  93466832    568111    568133
chr11     0         2  93466832    568149    568171
chr11     0         2  93466832   1880734   1880756
chr11     4  93466844  93466880  93466856  93466878
chr11     2  93466885 135006516  93466889  93466911
chr11     2  93466885 135006516  94199710  94199732

Note that the same interval may appear several times, for instance, an interval (B) will have been reported two times if it overlapped with two (A) intervals:

Astart(1)=========================Aend1    Astart(2)========================Aend(2)
          Bstart(1)=======================================Bend(1)

chrom value Astart   Aend  Bstart  Bend
chr1      0      0     25      15    35    #A(1) and B(1) overlap
chr1      1     28     45      15    35    #A(2) and B(1) overlap

Likewise, an interval (A) will have been reported two or more times if it overlapped with two or more (B) intervals:

Astart(3)===================================================================Aend(3)
          Bstart(2)=========Bend(2)  Bstart(3)===========Bend(3) Bstart(4)===============Bend(4)

chrom value Astart   Aend  Bstart  Bend
chr4      0     10    100      15    25    #A(3) and B(2) overlap
chr4      0     10    100      30    75    #A(3) and B(3) overlap
chr4      3     10    100      80   120    #A(3) and B(4) overlap

My goal is to output all the individual positions from intervals (B) and the corresponding values from (A). I have a piece of code that beautifully outputs all the relevant positions in (B):

position <- unlist(mapply(seq, ans$Bstart, ans$Bend - 1))
> head(position)
[1] 17408 17409 17410 17411 17412 17413

The problem with this is that it is not enough to retrieve the chromosome information back from there. I need to check chromosome information AND position at the same time when I list these positions. That is because the same position integer may occur on several chromosomes, so I can't afterwards just run something like for position %in% range(Astart, Aend) output $chrom, $value (dummy code).

How can I retrieve (chrom, position, value) at the same time?

The expected result would be something like this:

> head(expected_result)
chrom    position   value
chr1     17408      0
chr1     17409      0
chr1     17410      0
chr1     17411      0
chr1     17412      0
chr1     17413      0
#skipping some lines to show another part of the dataframe
chr11    93466856   4
chr11    93466857   4
Was it helpful?

Solution

A call to ddply might be more elegant, but the logic would be the same:

dfA = read.table(textConnection("chrom value    Astart      Aend    Bstart      Bend
 chr1     0         0  54519752     17408     17431
 chr1     0         0  54519752     17368     17391
 chr1     0         0  54519752    567761    567783
chr11     0         2  93466832    568111    568133
chr11     0         2  93466832    568149    568171
chr11     0         2  93466832   1880734   1880756
chr11     4  93466844  93466880  93466856  93466878
chr11     2  93466885 135006516  93466889  93466911
chr11     2  93466885 135006516  94199710  94199732"), header = TRUE)


dfB = as.data.frame(do.call(rbind, 
        apply(dfA, MARGIN = 1,  FUN = function(x) {
          cbind(mapply(seq, 
                       as.numeric(x['Bstart']),
                       as.numeric(x['Bend']) - 1),
                x['chrom'], x['value'])
        }
        )))
lapply(dfB, typeof)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top