using strsplit to split a variable three ways

https://stackoverflow.com/questions/21467219

r
strsplit

05-10-2022
|

Question

I have a variable that I would like to split... Each line is different but it either has 2 string expression separated by a ","; 3 string expression separate by a ','; 1 string expression; or nothing at all

For Example:

     indel
row1 +1C
row2 +1C,+2CC
row3 0
row4 +1C,+2CC,-1C

Essentially what I want to do is make 3 different variables for each of the possible three string expression. Of course, some rows will have 2, or 1 or none.

I have been able to split and created two different variables for the first two string expression using:

mito$indel1 <- sapply(strsplit(as.character(mito$indel),","),function(x) x[1])
mito$indel2 <- sapply(strsplit(as.character(mito$indel),","),function(x) x[2])

But of course, there is third string expression. I was thinking of creating a temporary indel2 variable, then splitting this again to make the third, but the problem with using the R script above is that it creates the variables as:

     indel         Indel1    Indel2
row1 +1C           +1C       NA
row2 +1C,+2CC      +1C       +2CC
row3 0             0         NA
row4 +1C,+2T,-1C   +1C       +2T

I'm sure this has to do with the second "," in the string and R is getting confused. But is there a way to overcome this without having to edit the entire variable for each row.

I've also tried the following with no luck:

mito$indel2 <- sapply(strsplit(sapply(strsplit(as.character(mito$indel),","),function(x) x[2]),","),function(x) x[1])
mito$indel3 <- sapply(strsplit(sapply(strsplit(as.character(mito$indel),","),function(x) x[2]),","),function(x) x[2])

Any help will be greatly appreciated.

Solution

You could also use read.table for this.

read.table(text=as.character(dat$V1), sep=',', fill=TRUE, as.is=TRUE)
#    V1   V2  V3
# 1 +1C         
# 2 +1C +2CC    
# 3   0         
# 4 +1C +2CC -1C

OTHER TIPS

Maybe the splitstackshape package:

library(splitstackshape)
dat <- read.table(text="+1C
+1C,+2CC
0
+1C,+2CC,-1C", header=FALSE)

splitstackshape:::read.concat(dat[, 1], "var", ",")

##  var_1 var_2 var_3
## 1   +1C            
## 2   +1C  +2CC      
## 3     0            
## 4   +1C  +2CC   -1C

A second base way but @Matthew's is a much better approach:

dat2 <- strsplit(as.character(dat[, 1]), ",")
lens <- sapply(dat2, length)
max(lens)
do.call(rbind, lapply(dat2, function(x) {
    x[max(lens)  + 1] <- NA
    x
}))[, -c(max(lens) + 1)]

##      [,1]  [,2]   [,3] 
## [1,] "+1C" NA     NA   
## [2,] "+1C" "+2CC" NA   
## [3,] "0"   NA     NA   
## [4,] "+1C" "+2CC" "-1C"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow