Question

When I use R open source, if not using a specific package, it's not possible handle data sets bigger than RAM memory. So I would like to know if it's possible handle big data sets applying PL/R functions inside PostgreSQL.

I didn't found any documentation about this.

Was it helpful?

Solution

As mentioned by Hong Ooi, PL/R loads an R interpreter into the PostgreSQL backend process. So your R code is running "in database".

There is no universal way to deal with memory limitations, but there are least two possible options:

  1. define a custom PostgreSQL aggregate, and use your PL/R function as the "final" function. In this way you are processing in groups, and thus less likely to have problems with memory. See the online PostgreSQL documentation and PL/R documentation for more detail (I don't post to stackoverflow often, so unfortunately it will not allow me to post the actual URLs for you)
  2. Use the pg.spi.cursor_open and pg.spi.cursor_fetch functions installed by PL/R into the R interpreter in order to page data into your R function in chunks.

See PL/R docs here: http://www.joeconway.com/plr/doc/index.html

I am guessing what you would really like to have is a data.frame in which the data is paged to and from an underlying database cursor transparently to your R code. This is on my long term TODO, but unfortunately I have not been able to find the time to work it out. I have been told that Oracle's R connector has this feature, so it seems it can be done. Patches welcomed ;-)

OTHER TIPS

No. PL/R just starts up a separate R process to run your R code. This uses exactly the same binaries and executables as what you'd use from the command line, so all the standard limitations still apply.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top