Question

i would like to form group of lines based in interconection (two ways) between "type1" collumn and "type2" collumn. The logic is: if a string in "type1" are in the same line of "type2" collumn they are in the same group. However if "type2" are more than one line all those are in the same group.

Please take a look in the first 3 lines: "gain_765" and "loss_1136" are related. However, "loss_1136" are related with "gain_766" and subsenquently "gain_766" are relate with "loss_765". Then these is my group: 1- "gain_765", 2- "loss_1136", 3-"gain_766", 4-"loss_765".

Inside this group I wanna to make a new line with string in "chrx" on first line of the group; lowest value in "startx" and "starty"; larger value in "endx" and "endy". Follow a example of my data:

 type1      chrx     startx  endx   chry    starty   endy    type2
gain_765    chr15   9681969 9685418 chr15   9660912 9712719 loss_1136
gain_766    chr15   9706682 9852347 chr15   9660912 9712719 loss_1136
gain_766    chr15   9706682 9852347 chr15   9765125 9863990 loss_765
gain_780    chr20   9706682 9852347 ch20    9765125 9863990 loss_769
gain_760    chr15   9706682 9852347 chr15   9660912 9712719 loss_1137
gain_760    chr15   9706682 9852347 chr15   9765125 9863990 loss_763

To first group (line 1 to 3) this is the expected output:

 chr       start     end
 chr15    9660912   9863990

Now, please take a look in line 4: "gain_780" is related just with "loss_769". Is this group (just line 4) the output expected follows:

 chr       start     end
chr20     9706682   9863990

Now, lines 5 and 6 the group is formed by "gain_760"; "loss_1137" and "loss_763". In this last case the expected output is:

  chr       start     end
 chr15     9660912   9863990

But, I have many of this cases in thousands of lines. Therefore, I need all results in a unique output, like that:

  chr       start     end
 chr15    9660912   9863990
 chr20    9706682   9863990
 chr15    9660912   9863990

Cheers.

Was it helpful?

Solution

You can do as follows :

library(igraph)

DF <- read.csv(text=
"type1,chrx,startx,endx,chry,starty,endy,type2
gain_765,chr15,9681969,9685418,chr15,9660912,9712719,loss_1136
gain_766,chr15,9706682,9852347,chr15,9660912,9712719,loss_1136
gain_766,chr15,9706682,9852347,chr15,9765125,9863990,loss_765
gain_780,chr20,9706682,9852347,ch20,9765125,9863990,loss_769
gain_760,chr15,9706682,9852347,chr15,9660912,9712719,loss_1137
gain_760,chr15,9706682,9852347,chr15,9765125,9863990,loss_763",
stringsAsFactors=F)

# create a graph with the relations type1 --> type2
# you can visualize it using: plot(g)
g <- graph.data.frame(DF[,c('type1','type2')])

# decompose in the connected components
subgraphs <- decompose.graph(g,mode="weak")

# create the sub data.frames using the subgraphs vertices
subDFs <- lapply(subgraphs,
                FUN=function(sg){ 
                      v <- V(sg)$name; 
                      DF[DF$type1 %in% v | DF$type2 %in% v,];
                    }
                )

# create the single-line data.frames for each group
subRes <- lapply(subDFs,
                 FUN=function(sd){
                       data.frame(chrx=sd$chrx[1], 
                                  start=min(c(sd$startx,sd$starty)), 
                                  end=max(c(sd$endx,sd$endy)))
                     }
                )

# merge the result in one single data.frame
res <- do.call(rbind.data.frame,subRes)

res 
>
  chrx   start     end
1 chr15 9660912 9863990
2 chr20 9706682 9863990
3 chr15 9660912 9863990

The step 2 and 3 (creation of subgraphs and subDFs) can be done in one step by putting the code of the function in the 3rd step in the function in the 2nd step.
I left them separated to be clearer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top