Question

I am updating a set of functions that previously only accepted data.frame objects to work with data.table arguments.

I decided to implement the function using R's method dispatch so that the old code using data.frames will still work with the updated functions. In one of my functions, I take in a data.frame as input, modify it, and then return the modified data.frame. I created a data.table implementation as well. For example:

# The functions
foo <- function(d) {
  UseMethod("foo")
}

foo.data.frame <- function(d) {
  <Do Something>
  return(d)
}

foo.data.table <- function(d) {
  <Do Something>
  return(d)
}

I know that data.table works by making changes without copying, and I implemented foo.data.table while keeping that in mind. However, I return the data.table object at the end of the function because I want my old scripts to work with the new data.table objects. Will this make a copy of the data.table? How can I check? According to the documentation, one has to be very explicit to create a copy of a data.table, but I am not sure in this case.

The reason I want to return something when I do not have to with data.tables:

My old scripts look like this

someData <- read.table(...)
...
someData <- foo(someData)

I want the scripts to be able to run with data.tables by just changing the data ingest lines. In other words, I want the script to work by just changing someData <- read.table(...) to someData <- fread(...).

Was it helpful?

Solution

Thanks to Arun for his answer in the comments. I will be using his example in his comments to answer the question.

One can check if copies are being made by using the tracemem function to track an object in R. From the help file of the function, ?tracemem, the description says:

This function marks an object so that a message is printed whenever the internal code copies the object. It is a major cause of hard-to-predict memory use in R.

For example:

# Using a data.frame
df <- data.frame(x=1:5, y=6:10)
tracemem(df)
## [1] "<0x32618220>"
df$y[2L] <- 11L
## tracemem[0x32618220 -> 0x32661a98]: 
## tracemem[0x32661a98 -> 0x32661b08]: $<-.data.frame $<- 
## tracemem[0x32661b08 -> 0x32661268]: $<-.data.frame $<- 
df
##   x  y
## 1 1  6
## 2 2 11
## 3 3  8
## 4 4  9
## 5 5 10

# Using a data.table
dt <- data.table(x=1:5, y=6:10)
tracemem(dt)
## [1] "<0x5fdab40>"
set(dt, i=2L, j=2L, value=11L) # No memory output!
address(dt) # Verify the address in memory is the same
## [1] "0x5fdab40"
dt
##    x  y
## 1: 1  6
## 2: 2 11
## 3: 3  8
## 4: 4  9
## 5: 5 10

It appears that the data.frame object is copied twice when changing one element in the data.frame, while the data.table is modified in place without making copies!

From my question, I can just track the data.table or data.frame object, d, before passing it on to the function, foo, to check if any copies were made.

OTHER TIPS

Not sure this adds anything, but as a cautionary tale note the following behavior:

library(data.table)
foo.data.table <- function(d) {
  d[,A:=4]
  d$B <- 1
  d[,C:=1]
  return(d)
}
set.seed(1)
dt     <- data.table(A=rnorm(5),B=runif(5),C=rnorm(5))
dt
#             A         B            C
# 1: -0.6264538 0.2059746 -0.005767173
# 2:  0.1836433 0.1765568  2.404653389
# 3: -0.8356286 0.6870228  0.763593461
# 4:  1.5952808 0.3841037 -0.799009249
# 5:  0.3295078 0.7698414 -1.147657009
result <- foo.data.table(dt)
dt
#    A         B            C
# 1: 4 0.2059746 -0.005767173
# 2: 4 0.1765568  2.404653389
# 3: 4 0.6870228  0.763593461
# 4: 4 0.3841037 -0.799009249
# 5: 4 0.7698414 -1.147657009
result
#    A B C
# 1: 4 1 1
# 2: 4 1 1
# 3: 4 1 1
# 4: 4 1 1
# 5: 4 1 1

So, evidently, dt is passed by reference to foo.data.table(...) and the first statement, d[,A:=4], modifies it by reference, changing column A in dt.

The second statement, d$B <- 1, forces the creation of a copy of d (also named d) scoped internal to the function. Then rhe third statement, d[,C:=1], modifies that by reference (but does not affect dt), and return(d) then returns the copy.

If you change the order of the second and third statements, the effect of the function call on dt is different.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top