Pregunta

This question is an extension of this one.

There is now an additional column in fileA that needs to be taken into account when extracting position information from the interval. For instance, in the example below, positions 123-78000 at location X are labelled romeo whereas the same positions 123-78000 at location Y are labelled mario:

location  start     end      value    label
X         123       78000    0        romeo    #value 0 at positions X(123 to 77999 included).
X         78000     78004    56       romeo    
X         78004     78005    12       romeo    #value 12 at position X(78004).
X         78006     78008    21       juliet   
X         78008     78056    8        juliet  
Y         123       78000    1        mario    #value 1 at positions Y(123 to 77999 included).
Y         78000     78004    24       mario    
Y         78004     78005    4        mario    #value 4 at position Y(78004).
Y         78006     78008    12       luigi   
Y         78008     78056    14       luigi  

On the other hand fileB defines the intervals that actually interest me:

location  start     end      label
X         77998     78005    romeo
X         78007     78012    juliet
Y         77998     78005    mario
Y         78007     78012    luigi

The labels in fileA were originally pulled in from fileB, so it is safe to assume that the labels are always equivalent for overlapping intervals.

I am trying to extract the information for all the individual positions in fileA that correspond to the intervals in fileB – a process which I will call deconvolution for lack of a better word. This time, I would like to do that while at the same time taking location into account - it's dangerous to extract location back from position as the same position numbers may appear in several locations. The output fileC should come up like this:

location  position  value   label
X         77998     0       romeo
X         77999     0       romeo
X         78000     56      romeo
X         78001     56      romeo
X         78002     56      romeo
X         78003     56      romeo
X         78004     12      romeo   
X         78007     21      juliet
X         78008     8       juliet
X         78009     8       juliet
X         78010     8       juliet
X         78011     8       juliet
Y         77998     1       mario
Y         77999     1       mario
Y         78000     24      mario
Y         78001     24      mario
Y         78002     24      mario
Y         78003     24      mario
Y         78004     4       mario   
Y         78007     12      luigi
Y         78008     14      luigi
Y         78009     14      luigi
Y         78010     14      luigi
Y         78011     14      luigi

I thought I would be able to implement this myself from the solution to my previous question, but I am stuck, especially on this part, I don't know how I can incorporate the location information to the position information:

# create sequence of positions
s <- unlist(apply(B, MARGIN=1, FUN=function(x) seq(x[2], as.numeric(x[3])-1)))

Thank you for your time.

¿Fue útil?

Solución

This seems to produce your sample output.

# It is essential that there be NO FACTORS
A<-read.table("fileA.txt",header=T,stringsAsFactors=F)
B<-read.table("fileB.txt",header=T,stringsAsFactors=F)

# build template with position in the appropriate ranges
template <- do.call(rbind,lapply(1:nrow(B),
                    function(i) cbind(location=B[i,]$location, 
                                      position=seq(B[i,]$start,B[i,]$end-1), 
                                      label=B[i,]$label)
))
template <- data.frame(template, stringsAsFactors=F)
# add position column to A, return as C
C <- merge(A,template,by=c("location","label"),all=T)

is.between <- function(x,low,hi) return(x>=low & x<=hi)
C <- C[is.between(C$position,C$start,C$end-1),]
C <- C[,c("location","position",value="value","label")]
C
#    location position value  label
# 1         X    78007    21 juliet
# 7         X    78008     8 juliet
# 8         X    78009     8 juliet
# 9         X    78010     8 juliet
# 10        X    78011     8 juliet
# 11        X    77998     0  romeo
# 12        X    77999     0  romeo
# 20        X    78000    56  romeo
# 21        X    78001    56  romeo
# 22        X    78002    56  romeo
# 23        X    78003    56  romeo
# 31        X    78004    12  romeo
# ...
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top