best practice for data.table use in “formal” code

https://softwareengineering.stackexchange.com/questions/386509

20-02-2021
|

Question

Consider a "large-ish" data set (~2-5M rows) that goes through multiple stages of cleaning/processing:

library(dplyr)
largedat %>%
  mutate(
    # overwrite v1 based on others
    v1    = somefunc(v1,v2,v3,v4),
    errv2 = anotherfunc(v2,v5)
) %>%
  group_by(v5) %>%
  mutate(
    v6    = otherfunc(v7,v8,v9),
    errv7 = fourthfunc(v7,v9)
  ) %>%
  ungroup() %>%
  mutate(
    v2 = if_else(errv2, NA, v2),
    v7 = if_else(errv7, NA, v7)
  )

With some hand-waving that there is sufficient need to keep things broken out like this (and that some portions might be faster if done manually in base R). The two functions here are clearly "functional" in that they have no side-effect, are given explicit vectors of arguments, and output a vector of the same length (or 1). In a sense, clean. Also, the potential for lots of copying of the data (depending).

Using data.table where in-place operations are standard, side-effect is by-design and an intentional decision that provides considerable improvements in memory and speed.

A more "functional" approach is still quite possible:

library(data.table)
setDT(largedat)
largedat[, newv1 := somefunc(v1, v2, v3, v4)]
errv2 <- largedat[, anotherfunc(v2,v5)]
largedat[, v6 := otherfunc(v7,v8,v9)]
# ...
# eventually using the changes
largedat[, c("v2", "v7") := list(ifelse(errv2, NA, v2), ifelse(errv7, NA, v7)) ]

This still preserves the functional and side-effectless use of the functions, but can be slightly cumbersome. If we understand that at least one of these functions outputs a full data.table instead of just a vector, it gets a little more complicated, especially when we're grouping with by="..." (which does not preserve order in the functional output) (ref: https://stackoverflow.com/q/11680579/3358272).

Another attempt might be to adapt the functions to be in-place operators, something like:

somefunc(largedat)    # replaces v1
anotherfunc(largedat) # optionally nullifies v2
# ...

or perhaps

out <- largedat[, somefunc(.SD)
         ][, anotherfunc(.SD)
           ][, otherfunc(.SD), by = "v5"
             ][, fourthfunc(.SD), by = "v5" ]

For simple projects, whatever works (reliably) is often best, but for longer-living packages where flexibility and reliability are required, are there distinct (dis)advantages to the in-place side-effect-based functions as used in the last two code samples?

Solution

Doing operations in-place gives you an efficient usage of the space (and efficient usage of time, if the copy operation is long vs. the operation itself), in the cost of changing the actual data which may be shared with other entities and thus creating dependencies.

Having operations that don't have side-effects makes it easier to parallelise the operations (in most cases) and having fault-tolerance in your operations, in the cost of extra space.

Regarding your question, what do you mean by "longer-living packages" ? Are you exposing an API and your package acts as a library, for other engineers ?
Some libraries/program languages specifies operations as being in-place/copied (bang! operator in Ruby for instance), yet you can achieve the same effect by giving the user the option to clone an object of the library and making all of its' operations in-place. (as data.table lets you to use 'copy' operation and dplyr lets you use 'tbl_df')
So, exposing an API which lets you to do the operations as efficient as possible for the usual case and letting the user to decide whether to copy the data beforehand (if he needs zero side-effects) is a good choice in my opinion.
(usual case as in R environment; which doesn't act in a distributed manner and mostly runs single-threaded in-memory)

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange