Merge by Range in R - Applying Loops

https://stackoverflow.com/questions/11892241

25-06-2021
|

Question

I posted a question here: Matched Range Merge in R about merging two files based on a number in one file falling into a range in the second file. Thus far, I have been unsuccessful in piecing together code to accomplish this. The issue I am having is that the code I'm using compares the files line by line. This is a problem because 1.) One file is much longer than the other file, and 2.) I need the lines in the shorter file to be scanned through every range pair in the longer file - not just the range in the same row.

I have been working with the functions posted in the original question, and I feel like there should be a way to apply it to a more general loop that compares every line in the first file to each line in the second file, but I haven't figured it out yet. If anyone has any suggestions, I would appreciate it.

**** EDITED.

The nature of the data is this: Each range is not necessarily unique, although most are. They are also not of equal size, and some fall completely within others. findInterval therefore produces an error, because the ranges cannot be sorted in order to fall in "non-decending" order.

Here are first 6 lines of each data frame:

file1test <- data.frame(SNP=c("rs2343", "rs211", "rs754", "rs854", "rs343", "rs626"), BP=c(860269, 369640, 861822, 367934, 706940, 717244))


file2 <- data.frame(Gene=c("E613", "E92", "E49", "E3543", "E11", "E233"), BP_start=c(367640, 621059, 721320, 860260, 861322, 879584), BP_end = c(368634, 622053, 722513, 879955, 879533, 894689))

So, as you can see, the range on the 5th line lies within the range on the 4th line, and two SNPs from the first file fall within the range on the 4th line, but only one falls within the range on the second line.

The first file, which contains the SNPs, has only ~400 lines. However the second file, containing the ranges, has about 20K. What I would like to produce as an output is a data frame containing the lines from the first file (the SNPs) with BPs that fall into the BP range in the second file. If a SNP falls into two ranges, then it would appear twice, etc.

Solution

The GenomicRanges package in Bioconductor is designed for this type of operation. Read your data in with, e.g., read.delim so that

con <- textConnection("SNP     BP
rs064   12292
rs319   345367
rs285   700042")
snps <- read.delim(con, head=TRUE, sep="")

con <- textConnection("Gene    BP_start    BP_end
E613    345344      363401
E92     694501      705408
E49     362370      368340") ## missing trailing digit on BP_end??
genes <- read.delim(con, head=TRUE, sep="")

then create 'IRanges' out of each

library(IRanges)
isnps <- with(snps, IRanges(BP, width=1, names=SNP))
igenes <- with(genes, IRanges(BP_start, BP_end, names=Gene))

(pay attention to coordinate systems, IRanges expects start and end to be included in the range; also, end >= start expect for 0-width ranges when end = start - 1). Then find the SNPs ('query' in IRanges terminology) that overlap the genes ('subject')

olaps <- findOverlaps(isnps, igenes)

two of the SNPs overlap

> queryHits(olaps)
[1] 2 3

and they overlap the first and second genes

> subjectHits(olaps)
[1] 1 2

If a query overlapped multiple genes, it would have been repeated in queryHits (and vice versa). You could then merge your data frames as

> cbind(snps[queryHits(olaps),], genes[subjectHits(olaps),])
    SNP     BP Gene BP_start BP_end
2 rs319 345367 E613   345344 363401
3 rs285 700042  E92   694501 705408

Usually genes and SNPs have chromosome and strand ('+', '-', or '*' to indicate that strand isn't important) information, and you'd want to do overlaps in the context of these; instead of creating 'IRanges' instances, you'd create 'GRanges' (genomic ranges) and the subsequent book-keeping would be taken care of for you

library(GenomicRanges)
isnps <-
    with(snps, GRanges("chrA", IRanges(BP, width=1, names=SNP), "*")
igenes <-
    with(genes, GRanges("chrA", IRanges(BP_start, BP_end, names=Gene), "+"))

OTHER TIPS

I believe what you're asking for is a conditional join. They're easy in SQL, and the sqldf package makes it easy to query data frames in R using SQL.

Just pick a version depending on how you want unmatched SNPs handled.

Inner join version:

> sqldf("select * from file1test f1 inner join file2 f2 
+       on (f1.BP > f2.BP_start and f1.BP<= f2.BP_end) ")

Output:

     SNP     BP  Gene BP_start BP_end
1 rs2343 860269 E3543   860260 879955
2  rs754 861822 E3543   860260 879955
3  rs754 861822   E11   861322 879533
4  rs854 367934  E613   367640 368634
>

Left Join version:

> sqldf("select * from file1test f1 left join file2 f2 
+       on (f1.BP > f2.BP_start and f1.BP<= f2.BP_end) ")

Output:

     SNP     BP  Gene BP_start BP_end
1 rs2343 860269 E3543   860260 879955
2  rs211 369640  <NA>       NA     NA
3  rs754 861822 E3543   860260 879955
4  rs754 861822   E11   861322 879533
5  rs854 367934  E613   367640 368634
6  rs343 706940  <NA>       NA     NA
7  rs626 717244  <NA>       NA     NA
>

Note that you may want to be careful where you place the = if it matters which group a BP will fall in for the case where a BP exactly matches a BP_start or BP_end.

You can merge by range in base by using apply and which. To return only those which match (inner join):

do.call(rbind, apply(file1test, 1, function(x) {
  if(length(idx <- which(x["BP"] >= file2$BP_start & x["BP"] <= file2$BP_end)) > 0) {
    cbind(rbind(x), file2[idx,])
  }
}))
#      SNP     BP  Gene BP_start BP_end
#x  rs2343 860269 E3543   860260 879955
#1   rs754 861822 E3543   860260 879955
#2   rs754 861822   E11   861322 879533
#x1  rs854 367934  E613   367640 368634

or all from file1test (left join):

do.call(rbind, apply(file1test, 1, function(x) {
  if(length(idx <- which(x["BP"] >= file2$BP_start & x["BP"] <= file2$BP_end)) > 0) {
    cbind(rbind(x), file2[idx,])
  } else {
    cbind(rbind(x), file2[1,][NA,])
  }
}))
#      SNP     BP  Gene BP_start BP_end
#x  rs2343 860269 E3543   860260 879955
#x1  rs211 369640  <NA>       NA     NA
#1   rs754 861822 E3543   860260 879955
#2   rs754 861822   E11   861322 879533
#x2  rs854 367934  E613   367640 368634
#x3  rs343 706940  <NA>       NA     NA
#x4  rs626 717244  <NA>       NA     NA

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow