Question

I have the following dataset elections:

var1 <- c("125677", "255422", "475544", "333344", "233452")
var2 <- c("PRB", "PAN", "PR", "PV", "PJ")
var3 <- c("PCB/PTdoB/PCO/PRB", "PAN", "DEM/PR/PT/PSDB/PMDB/PV", "DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL", "DEM/PJ")
elections <- cbind(var1, var2, var3)

Which looks like this:

var1 var2 var3  
---------------    
125677 PRB PCB/PTdoB/PCO/PRB  
255422 PAN PAN  
475544 PR DEM/PR/PT/PSDB/PMDB/PV  
333344 PV DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL 
233452 PJ DEM/PJ

I want to disaggregate var3in eight additional variables, var4 to var11, that can be filled by the characters separated by / in var3. Therefore the result I want is this:

var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11
---------------------------------------------------------    
125677 PRB PCB/PTdoB/PCO/PRB PCB PTdoB PCO PRB  
255422 PAN PAN PAN
475544 PR DEM/PR/PT/PSDB/PMDB/PV DEM PR PT PSDB PMDB PV
333344 PV DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL DEM PR PT PSDB PMDB PV PSTU PSOL   
233452 PJ DEM/PJ DEM PJ

I was able to get a result close to the one I want with strsplit(elections$var3, '/'), but the problem is that this produces a list of objects. Therefore it works when there is only one element in var3, but it does not when there is more than one.

Any ideas?

Was it helpful?

Solution

A direct way would be to use read.csv (or read.table) on that variable (either before or after you add it to your existing dataset). Here, I've used read.csv which defaults with a fill = TRUE argument that will let you split the data the way you are expecting to.

Here's an example:

read.csv(text = elections[, "var3"], sep = "/", header = FALSE)
#    V1    V2  V3   V4   V5 V6   V7   V8
# 1 PCB PTdoB PCO  PRB                  
# 2 PAN                                 
# 3 DEM    PR  PT PSDB PMDB PV          
# 4 DEM    PR  PT PSDB PMDB PV PSTU PSOL
# 5 DEM    PJ   

Or, possibly (if your dataset is a data.frame):

read.csv(text = as.character(elections$var3), sep = "/", header = FALSE)

This approach is essentially what is taken with concat.split from my "splitstackshape" package, though it does a little bit more checking and will conveniently combine the output back into the original dataset.

Assuming now "elections" is a data.frame, usage would be:

library(splitstackshape)
concat.split(elections, "var3", "/", drop = TRUE)
#     var1 var2 var3_1 var3_2 var3_3 var3_4 var3_5 var3_6 var3_7 var3_8
# 1 125677  PRB    PCB  PTdoB    PCO    PRB                            
# 2 255422  PAN    PAN                                                 
# 3 475544   PR    DEM     PR     PT   PSDB   PMDB     PV              
# 4 333344   PV    DEM     PR     PT   PSDB   PMDB     PV   PSTU   PSOL
# 5 233452   PJ    DEM     PJ                                          

Update

Ultimately, however, read.csv is somewhat slow (so by extension, the concat.split approach would be slow). The approach I'm working on for a revision of the function is along the following lines until I come up with something better:

myMat <- function(inVec, sep) {
  if (!is.character(inVec)) inVec <- as.character(inVec)
  nCols <- max(vapply(gregexpr(sep, inVec, fixed = TRUE), length, 1L)) + 1
  M <- matrix("", ncol = nCols, nrow = length(inVec))
  Spl <- strsplit(inVec, sep, fixed = TRUE)
  Len <- vapply(Spl, length, 1L)
  Ind <- cbind(rep(seq_along(Len), Len), sequence(Len))
  M[Ind] <- unlist(Spl)
  M
}

Some benchmarks

Sample data:

var1 <- c("125677", "255422", "475544", "333344", "233452")
var2 <- c("PRB", "PAN", "PR", "PV", "PJ")
var3 <- c("PCB/PTdoB/PCO/PRB", "PAN", "DEM/PR/PT/PSDB/PMDB/PV", "DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL", "DEM/PJ")
elections <- data.frame(var1, var2, var3)

Functions to evaluate:

fun1 <- function() myMat(elections$var3, "/")
fun2 <- function() read.csv(text = as.character(elections$var3), sep = "/", header = FALSE)

The results:

microbenchmark(fun1(), fun2())
# Unit: microseconds
#    expr     min        lq   median        uq      max neval
#  fun1() 159.936  175.5445  193.291  244.6075  566.188   100
#  fun2() 974.151 1017.1280 1070.796 1690.0100 2146.724   100

BIGGER data (but still not very big):

elections <- do.call(rbind, replicate(5000, elections, simplify = FALSE))
dim(elections)
# [1] 25000     3

microbenchmark(fun1(), fun2(), times = 10)
# Unit: milliseconds
#    expr       min        lq    median       uq       max neval
#  fun1()  195.1358  211.8841  232.1093  287.560  324.6918    10
#  fun2() 2764.8115 3524.7989 3626.1480 3639.303 3728.2099    10

I run out of patience waiting for one million rows with fun2(), but for fun1(), it takes about 19 seconds, which is OK, but not something I'm totally happy with.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top