Fast way to split string and convert to long format in data.table

Question

You'll get a big speedup if you just ditch using str_split() from "stringr" and just use strsplit().

fun1 <- function() dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column]
fun2 <- function() dt[, list(name = unlist(strsplit(string_column, '\\s+'))), by = string_column]

system.time(fun1())
#    user  system elapsed 
#  172.41    0.05  172.82 

system.time(fun2())
#    user  system elapsed 
#   11.22    0.01   11.23

Whether this will make your processing time down from one hour to 4 minutes or not, I'm not sure. But at least you won't have to remember to put in those pesky underscores in your function names :-)

If you can split on a fixed search pattern, you can use the fixed = TRUE argument, which will give you another substantial speed boost.

Another thing to consider is to do the process manually:

x <- strsplit(dt$string_column, "\\s+")
DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))]
DT[, name := unlist(x, use.names = FALSE)]
DT

With your sample data:

fun4 <- function() {
  x <- strsplit(dt$string_column, "\\s+")
  DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))]
  DT[, name := unlist(x, use.names = FALSE)]
  DT
}
#    user  system elapsed 
#    1.79    0.01    1.82

However, the answer is not the same as what I get with fun2(), but that's because you have duplicated values in "string_column". If you add an "id" column and do the same, you will get the same results.