Merge dataframes based on overlapping genomic ranges

https://stackoverflow.com/questions/19146152

30-06-2022
|

Pergunta

I have two files:

anno

  chromosome position functionGVS
1      chr22 16050036  intergenic
2      chr22 16050039  intergenic
3      chr22 16050094  intergenic
4      chr22 16050097  intergenic
5      chr22 16050109  intergenic
6      chr22 16050115  intergenic

huvec

    chr    start      end function
1 chr22 16050000 16051244  R
2 chr22 16051244 16051521  T
3 chr22 16051521 16060433  R
4 chr22 16060433 16060582  T
5 chr22 16060582 16080564  R
6 chr22 16080564 16082420  T

I am trying to find overlapping regions such that the anno$position should fall within the range of huvec$start & huvec$end. Here is my code:

gr.huvec = with(huvec, GRanges(V1, IRanges(start=V2,end=V3)))

gr.anno <- GRanges(seqnames=anno$chromosome, ranges=IRanges(start=anno$position, width=1))

hits = findOverlaps(gr.huvec,gr.anno)

My question is that now, after I have the query hits & subject hits, how can I assign huvec$function to anno based on overlapping regions. So in my case, each position in anno$position overlaps with the first start & end values of huvec and so I want to assign the associated huvec$function i.e. 'R' to a new column in anno. Any suggestions?

Solução

I figured another way out, thought it could be of help to others as well:

anno[subjectHits(hits),4]<-huvec[queryHits(hits),4]

I checked the solution, and it comes out correct. But honestly, I am not sure how this one worked i.e. how it could find the corresponding hits?

Outras dicas

Your sample data for anno is all in the first interval, but I think this should to the trick:

anno$function <- huvec$function[cut(anno$position, huvec$start, labels=FALSE)]

The one issue is that this will return NA for the final interval, so you could replace huvec$start with unique(huvec$start, huvec$end)

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow