I have the following dataset elections:

var1 <- c("125677", "255422", "475544", "333344", "233452")
var2 <- c("PRB", "PAN", "PR", "PV", "PJ")
var3 <- c("PCB/PTdoB/PCO/PRB", "PAN", "DEM/PR/PT/PSDB/PMDB/PV", "DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL", "DEM/PJ")
elections <- cbind(var1, var2, var3)

Which looks like this:

var1 var2 var3  
---------------    
125677 PRB PCB/PTdoB/PCO/PRB  
255422 PAN PAN  
475544 PR DEM/PR/PT/PSDB/PMDB/PV  
333344 PV DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL 
233452 PJ DEM/PJ

I want to disaggregate var3in eight additional variables, var4 to var11, that can be filled by the characters separated by / in var3. Therefore the result I want is this:

var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11
---------------------------------------------------------    
125677 PRB PCB/PTdoB/PCO/PRB PCB PTdoB PCO PRB  
255422 PAN PAN PAN
475544 PR DEM/PR/PT/PSDB/PMDB/PV DEM PR PT PSDB PMDB PV
333344 PV DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL DEM PR PT PSDB PMDB PV PSTU PSOL   
233452 PJ DEM/PJ DEM PJ

I was able to get a result close to the one I want with strsplit(elections$var3, '/'), but the problem is that this produces a list of objects. Therefore it works when there is only one element in var3, but it does not when there is more than one.

Any ideas?

有帮助吗?

解决方案

A direct way would be to use read.csv (or read.table) on that variable (either before or after you add it to your existing dataset). Here, I've used read.csv which defaults with a fill = TRUE argument that will let you split the data the way you are expecting to.

Here's an example:

read.csv(text = elections[, "var3"], sep = "/", header = FALSE)
#    V1    V2  V3   V4   V5 V6   V7   V8
# 1 PCB PTdoB PCO  PRB                  
# 2 PAN                                 
# 3 DEM    PR  PT PSDB PMDB PV          
# 4 DEM    PR  PT PSDB PMDB PV PSTU PSOL
# 5 DEM    PJ   

Or, possibly (if your dataset is a data.frame):

read.csv(text = as.character(elections$var3), sep = "/", header = FALSE)

This approach is essentially what is taken with concat.split from my "splitstackshape" package, though it does a little bit more checking and will conveniently combine the output back into the original dataset.

Assuming now "elections" is a data.frame, usage would be:

library(splitstackshape)
concat.split(elections, "var3", "/", drop = TRUE)
#     var1 var2 var3_1 var3_2 var3_3 var3_4 var3_5 var3_6 var3_7 var3_8
# 1 125677  PRB    PCB  PTdoB    PCO    PRB                            
# 2 255422  PAN    PAN                                                 
# 3 475544   PR    DEM     PR     PT   PSDB   PMDB     PV              
# 4 333344   PV    DEM     PR     PT   PSDB   PMDB     PV   PSTU   PSOL
# 5 233452   PJ    DEM     PJ                                          

Update

Ultimately, however, read.csv is somewhat slow (so by extension, the concat.split approach would be slow). The approach I'm working on for a revision of the function is along the following lines until I come up with something better:

myMat <- function(inVec, sep) {
  if (!is.character(inVec)) inVec <- as.character(inVec)
  nCols <- max(vapply(gregexpr(sep, inVec, fixed = TRUE), length, 1L)) + 1
  M <- matrix("", ncol = nCols, nrow = length(inVec))
  Spl <- strsplit(inVec, sep, fixed = TRUE)
  Len <- vapply(Spl, length, 1L)
  Ind <- cbind(rep(seq_along(Len), Len), sequence(Len))
  M[Ind] <- unlist(Spl)
  M
}

Some benchmarks

Sample data:

var1 <- c("125677", "255422", "475544", "333344", "233452")
var2 <- c("PRB", "PAN", "PR", "PV", "PJ")
var3 <- c("PCB/PTdoB/PCO/PRB", "PAN", "DEM/PR/PT/PSDB/PMDB/PV", "DEM/PR/PT/PSDB/PMDB/PV/PSTU/PSOL", "DEM/PJ")
elections <- data.frame(var1, var2, var3)

Functions to evaluate:

fun1 <- function() myMat(elections$var3, "/")
fun2 <- function() read.csv(text = as.character(elections$var3), sep = "/", header = FALSE)

The results:

microbenchmark(fun1(), fun2())
# Unit: microseconds
#    expr     min        lq   median        uq      max neval
#  fun1() 159.936  175.5445  193.291  244.6075  566.188   100
#  fun2() 974.151 1017.1280 1070.796 1690.0100 2146.724   100

BIGGER data (but still not very big):

elections <- do.call(rbind, replicate(5000, elections, simplify = FALSE))
dim(elections)
# [1] 25000     3

microbenchmark(fun1(), fun2(), times = 10)
# Unit: milliseconds
#    expr       min        lq    median       uq       max neval
#  fun1()  195.1358  211.8841  232.1093  287.560  324.6918    10
#  fun2() 2764.8115 3524.7989 3626.1480 3639.303 3728.2099    10

I run out of patience waiting for one million rows with fun2(), but for fun1(), it takes about 19 seconds, which is OK, but not something I'm totally happy with.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top