Easy with sqldf, annoying with base R:
R>require(sqldf)
R>b$id <- 1:nrow(b)
R>sqldf("select id, b.chr, sum(a.end - a.start) as diff
from a, b where a.start >= b.start and b.end >= a.end group by id")
id chr diff
1 1 1 5
2 2 1 4
Вопрос
I have two data frames, a
and b
.
For each row in b
, I want to find all start,end
in a
that are within the start,end
of b
, and then sum differences of start,end
of this particular subset of a
, and store it as a new column in b
. I'm using a for
loop but is there a more efficient way to do this with apply
in R?
# data.frame a
a <- data.frame(chrom=1L, start=as.integer(c(2,4,7,11)), end=as.integer(c(3,6,9,15)))
# chrom start end
# 1 2 3
# 1 4 6
# 1 7 9
# 1 11 15
# data.frame b
b <- data.frame(chr=1L, start=as.integer(c(2,11)), end=as.integer(c(10,20)))
# chrom start end
# 1 2 10
# 1 11 20
# code
result=c()
for (i in 1:dim(b)[1]) {
# find start,end in A that are within
a_subset = a[which(a$chrom == b[i, ]$chrom &
a$start >= b[i, ]$start &
a$end <= b[i, ]$end), ]
result = append(result, sum(a_subset$end - a_subset$start))
}
c = cbind(b, result)
# data.frame c
# chrom start end result
# 1 2 10 5
# 1 11 20 4
Решение
Easy with sqldf, annoying with base R:
R>require(sqldf)
R>b$id <- 1:nrow(b)
R>sqldf("select id, b.chr, sum(a.end - a.start) as diff
from a, b where a.start >= b.start and b.end >= a.end group by id")
id chr diff
1 1 1 5
2 2 1 4