A direct way would be to use read.csv
(or read.table
) on that variable (either before or after you add it to your existing dataset). Here, I've used read.csv
which defaults with a fill = TRUE
argument that will let you split the data the way you are expecting to.
Here's an example:
read.csv(text = elections[, "var3"], sep = "/", header = FALSE)
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 PCB PTdoB PCO PRB
# 2 PAN
# 3 DEM PR PT PSDB PMDB PV
# 4 DEM PR PT PSDB PMDB PV PSTU PSOL
# 5 DEM PJ
Or, possibly (if your dataset is a data.frame
):
read.csv(text = as.character(elections$var3), sep = "/", header = FALSE)
This approach is essentially what is taken with concat.split
from my "splitstackshape" package, though it does a little bit more checking and will conveniently combine the output back into the original dataset.
Assuming now "elections" is a data.frame
, usage would be:
library(splitstackshape)
concat.split(elections, "var3", "/", drop = TRUE)
# var1 var2 var3_1 var3_2 var3_3 var3_4 var3_5 var3_6 var3_7 var3_8
# 1 125677 PRB PCB PTdoB PCO PRB
# 2 255422 PAN PAN
# 3 475544 PR DEM PR PT PSDB PMDB PV
# 4 333344 PV DEM PR PT PSDB PMDB PV PSTU PSOL
# 5 233452 PJ DEM PJ
Update
Ultimately, however, read.csv
is somewhat slow (so by extension, the concat.split
approach would be slow). The approach I'm working on for a revision of the function is along the following lines until I come up with something better:
myMat <- function(inVec, sep) {
if (!is.character(inVec)) inVec <- as.character(inVec)
nCols <- max(vapply(gregexpr(sep, inVec, fixed = TRUE), length, 1L)) + 1
M <- matrix("", ncol = nCols, nrow = length(inVec))
Spl <- strsplit(inVec, sep, fixed = TRUE)
Len <- vapply(Spl, length, 1L)
Ind <- cbind(rep(seq_along(Len), Len), sequence(Len))
M[Ind] <- unlist(Spl)
M
}
Some benchmarks
Sample data:
var1 <- c("125677", "255422", "475544", "333344", "233452")
var2 <- c("PRB", "PAN", "PR", "PV", "PJ")
var3 <- c("PCB/PTdoB/PCO/PRB", "PAN", "DEM/PR/PT/PSDB/PMDB/PV", "DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL", "DEM/PJ")
elections <- data.frame(var1, var2, var3)
Functions to evaluate:
fun1 <- function() myMat(elections$var3, "/")
fun2 <- function() read.csv(text = as.character(elections$var3), sep = "/", header = FALSE)
The results:
microbenchmark(fun1(), fun2())
# Unit: microseconds
# expr min lq median uq max neval
# fun1() 159.936 175.5445 193.291 244.6075 566.188 100
# fun2() 974.151 1017.1280 1070.796 1690.0100 2146.724 100
BIGGER data (but still not very big):
elections <- do.call(rbind, replicate(5000, elections, simplify = FALSE))
dim(elections)
# [1] 25000 3
microbenchmark(fun1(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 195.1358 211.8841 232.1093 287.560 324.6918 10
# fun2() 2764.8115 3524.7989 3626.1480 3639.303 3728.2099 10
I run out of patience waiting for one million rows with fun2()
, but for fun1()
, it takes about 19 seconds, which is OK, but not something I'm totally happy with.