You'll get a big speedup if you just ditch using str_split()
from "stringr" and just use strsplit()
.
fun1 <- function() dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column]
fun2 <- function() dt[, list(name = unlist(strsplit(string_column, '\\s+'))), by = string_column]
system.time(fun1())
# user system elapsed
# 172.41 0.05 172.82
system.time(fun2())
# user system elapsed
# 11.22 0.01 11.23
Whether this will make your processing time down from one hour to 4 minutes or not, I'm not sure. But at least you won't have to remember to put in those pesky underscores in your function names :-)
If you can split on a fixed search pattern, you can use the fixed = TRUE
argument, which will give you another substantial speed boost.
Another thing to consider is to do the process manually:
x <- strsplit(dt$string_column, "\\s+")
DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))]
DT[, name := unlist(x, use.names = FALSE)]
DT
With your sample data:
fun4 <- function() {
x <- strsplit(dt$string_column, "\\s+")
DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))]
DT[, name := unlist(x, use.names = FALSE)]
DT
}
# user system elapsed
# 1.79 0.01 1.82
However, the answer is not the same as what I get with fun2()
, but that's because you have duplicated values in "string_column". If you add an "id" column and do the same, you will get the same results.