Question

I do the following

library(data.table)
library(stringr)        
dt <- data.table(string_column = paste(sample(c(letters, " "), 500000, replace = TRUE)
                                     , sample(c(letters, " "), 500000, replace = TRUE)
                                     , sample(1:500000)
                                 , sep = " "), key = "string_column") 

split_res <- dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column]

For real data, it takes approx. 1 hour to process dt (10M rows) and create split_res (18M rows) Out of curiosity - is there way to speed up process? Maybe unlist + str_split is not the right way of doing this?

Was it helpful?

Solution

You'll get a big speedup if you just ditch using str_split() from "stringr" and just use strsplit().

fun1 <- function() dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column]
fun2 <- function() dt[, list(name = unlist(strsplit(string_column, '\\s+'))), by = string_column]

system.time(fun1())
#    user  system elapsed 
#  172.41    0.05  172.82 

system.time(fun2())
#    user  system elapsed 
#   11.22    0.01   11.23 

Whether this will make your processing time down from one hour to 4 minutes or not, I'm not sure. But at least you won't have to remember to put in those pesky underscores in your function names :-)


If you can split on a fixed search pattern, you can use the fixed = TRUE argument, which will give you another substantial speed boost.


Another thing to consider is to do the process manually:

x <- strsplit(dt$string_column, "\\s+")
DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))]
DT[, name := unlist(x, use.names = FALSE)]
DT

With your sample data:

fun4 <- function() {
  x <- strsplit(dt$string_column, "\\s+")
  DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))]
  DT[, name := unlist(x, use.names = FALSE)]
  DT
}
#    user  system elapsed 
#    1.79    0.01    1.82

However, the answer is not the same as what I get with fun2(), but that's because you have duplicated values in "string_column". If you add an "id" column and do the same, you will get the same results.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top