Efficient way to split a vector of cigars using mclapply

https://stackoverflow.com/questions/19069973

r
mclapply

29-06-2022
|

Question

I have a very large vector of cigars:

my.vector = c("44M2D1I","32M465N3M", "3S4I3D45N65M")

That I'd like to transform to a vector of splitted cigars - the logic is as follows: whenever I find a number followed by an N, I have to split it, that is why I splited "32M465N3M" to "32M","465N","3M"; and "3S4I3D45N65M" to "3S4I3D", "45N", "65M"; and "44M2D1I" did not get split because it had no "N" in it.

my.vector.split = c("44M2D1I, "32M", "465N", "3M", "3S4I3D", "45N", "65M").

My vector is very large so ideally I'd like to use the parallel capabilities of the cluster. I'd like to use mclapply with ncores.

Ideally, I'd like to define something like this:

 my.vector.split = unlist(mclapply(my.vector, my.splitting.function, mc.cores = ncores))

where the length of my.vector.split is length(my.vector) + (number of Ns)*2.

Note. The HPC cluster I am using does not have the latest bioconductor installed so I cannot use cigartoRleList, and other nice cigar operation tools.

Solution

This should be applicable. Details will vary depending how you set up your clusters but basically this will return a series of dataframe. If you wanted them as vectors then wrap unlist around them:

 lapply(gsub("([[:digit:]]+N)", ",\\1,", my.vector) , 
         function(x) unlist( read.table(text=x,sep=",",colClasses="character")) )
#------------
[[1]]
       V1 
"44M2D1I" 

[[2]]
    V1     V2     V3 
 "32M" "465N"   "3M" 

[[3]]
      V1       V2       V3 
"3S4I3D"    "45N"    "65M"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow