sqldf and maintainability of R code base [closed]

https://stackoverflow.com/questions/21641966

08-10-2022
|

Frage

If one is building a substantial, organization-wide code base in R, is it acceptable practice to rely on the sqldf package as the default approach for data munging tasks? Or is best practice to rely on operations with R specific syntax where possible? By relying on sqldf, one is introducing a substantial amount of a different syntax, SQL, into their R code base.

I'm asking this question with specific regard to maintainability and style. I've searched existing R style guides and did not find anything on this subject.

EDIT: To clarify the workflow I'm concerned with, consider a data munging script making ample use of sqldf as follows:

library(sqldf)
gclust_group<-sqldf("SELECT clust,SUM(trips) AS trips2
                FROM gclust
                GROUP BY clust")

gclust_group2<-sqldf("SELECT g.*, h.Longitude,h.Latitude,h.withinss, s.trips2
                 FROM highestd g
                 LEFT JOIN centers h
                 ON g.clust=h.clust
                 LEFT JOIN gclust_group s
                 ON g.clust=s.clust")

And such a script could continue for many lines. (For those familiar with Hadoop and PIG, the style is actually similar to a PIG script). Most of the work is done using SQL syntax, albeit with the benefit of avoiding complex subqueries.

Lösung

Write functions. Functions with clear names that describe their purpose. Document them. Write tests.

Whether the functions contain sqldf parts, or use dplyr, or use bare R code, or call Rcpp is at that level irrelevant.

But if you want to try changing something from sqldf to dplyr the important thing is that you have a stable platform on which to experiment, which means well-defined functions and a good set of tests. Maybe there's a bottleneck in one function that might run 100x faster if you do it with dplyr? Great, you can profile and test the code with both.

You can even branch your code and have a sqldf branch and a dplyr branch in your revision control system (you are using an RCS, right?) and work in parallel until you get a winner.

It honestly doesn't matter if you are introducing other bits of syntax into your R code from a maintainability perspective if your codebase is well-documented and tested.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow