Pregunta

This one is almost a philosophical question: is it bad to access and/or set slots of S4 objects directly using @?

I have always been told it was bad practice, and that users should use "accessor" S4 methods, and that developers should provide their users with these. But I'd like to know if anybody knows the real deal behind this?

Here's an example using the sp package (but could be generalised for any S4 class):

> library(sp)
> foo <- data.frame(x = runif(5), y = runif(5), bar = runif(5))
> coordinates(foo) <- ~x+y
> class(foo)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"

> str(foo)
Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
  ..@ data       :'data.frame': 5 obs. of  1 variable:
  .. ..$ bar: num [1:5] 0.621 0.273 0.446 0.174 0.278
  ..@ coords.nrs : int [1:2] 1 2
  ..@ coords     : num [1:5, 1:2] 0.885 0.763 0.591 0.709 0.925 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr [1:2] "x" "y"
  ..@ bbox       : num [1:2, 1:2] 0.591 0.155 0.925 0.803
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:2] "x" "y"
  .. .. ..$ : chr [1:2] "min" "max"
  ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots
  .. .. ..@ projargs: chr NA

> foo@data
        bar
1 0.6213783
2 0.2725903
3 0.4458229
4 0.1743419
5 0.2779656
> foo@data <- data.frame(bar = letters[1:5], baz = runif(5))
> foo@data
  bar        baz
1   a 0.22877446
2   b 0.93206667
3   c 0.28169866
4   d 0.08616213
5   e 0.36713750
¿Fue útil?

Solución

In general it is good programming practice to separate the content of an object from the interface, see this wikipedia article. The idea is to have the interface separate from the implementation, in that way the implementation can change considerably, without affecting any of the code that interfaces with that code, e.g. your script. Therefore, using @ creates less robust code that is less likely to work in a few years time. For example in the sp-package mentioned by @mdsummer the implementation of how polygons are stored might change because of speed or progressing knowledge. Using @, your code breaks down, using the interface your code works still. Except ofcourse if the interface also changes. But changes to the implementation are much more likely than interface changes.

Otros consejos

In this question a stackoverflow-er asks why they can't find the end slot in a Bioconductor IRanges object; after all there are start(), width(), and end() accessors and start and width slots. The answer is because the way users interface with the class differs from how it is implemented. In this case, the implementation is driven by the simple observation that it is not space-efficient to store three values (start, end, width) when only two (which two? up to the developer!) are sufficient. Similar but deeper examples of divergence between interface and implementation are present in other S4 objects and in common S3 instances like the one returned by lm, where the data stored in the class is appropriate for subsequent calculation rather than tailored to represent the quantities that a particular user might be most interested in. Nothing good will come if you were to reach in to that lm instance and change a value, e.g., the coefficients element. This separation of interface from implementation gives the developer a lot of freedom to provide a reasonable and constant user experience, perhaps shared with other similar classes, but to implement (and to change the implementation) classes in ways that makes programming sense.

This doesn't really answer your question, I guess, but the developer is not expecting the user to directly access slots and the user should not expect direct slot access to be an appropriate way to interact with the class.

In short, the developer should provide methods for every use case, but in practice this is pretty darn hard and it is complicated to cover every possible use. Technically and as far as I am concerned, if you need more than the developer provides and you must use "@" to get at unexposed features, then you are a developer (the distinction is happily blurry here in GNU software).

The sp package is a good example to ask this question about, since the complications of heirarchical data structures required by "polygons" and "lines" throw up some pretty simple issues. Here is one:

The coordinates() method for polygons and lines returns only a centroid for each object, though for points it returns every "coordinate" from the object, but that is because "Points" are "one-to-one". One object, one coordinate, true also for SpatialPoints and SpatialPointsDataFrame. This is not true for Line and Polygon, or Lines and Polygons, or SpatialLines and SpatialPolygons, or SpatialLinesDataFrame and SpatialPolygonsDataFrame. These are composed inherently of >two coordinate line tracks or >three-coordinate poly "rings". How to get the coordinate of every vertex in every Polygon from every multi-branched SpatialPolygon? You cannot unless you delve into the developer structure with "@".

Is it remiss of the developers to not have provided this? No, the advantages very much outweigh the problems any particular user can see in hindsight. In general, the fact that you can delve in is a massive bonus but you automatically take on the onus of the developer, and probably make the situation harder if you choose to share your efforts without wrapping it in methods.

As developer of a S4 class, my opinion is:

If you read slots with @, you do that at your own risk (like pretty much everything you do in R - see below for some famous examples). That being said, the slots of an S4 class are actually part of the documented interface.

The main advantages of access via @ I see is speed:

> microbenchmark (accessor = wl (chondro), direct = chondro@wavelength)
Unit: nanoseconds
      expr    min       lq   median       uq    max
1 accessor 333431 341289.5 346784.5 366737.5 654219
2   direct    165    212.5    395.0    520.0   1440

(the accessor function does valitidy checking in addition to returning the @wavelength slot which causes the difference. I'd expect every decent public accessor function to ensure validity)

I even recommend using read access to the slots of my class in time-critical situations (e.g. if lots of subsets of the same object are accessed, it may be worth while to skip checking the validity of an unchanged object every time), and in the code of my package I predominantly read the slots directly, ensuring validity at the beginning of functions and at the end of functions where the object could have become invalid. One may argue that the (R) design decision that @<- does not check validity does cause an enormous overhead in practice because methods working on S4 objects can not rely on the object being valid and thus even methods with purely read access have to do validity checking.

If you think about write access to a slot, you should really know what you are doing. @<- does not do any validity checking, the official write accessor should do that. And, the write accessor possibly does much more than just update one slot in order to keep the object's state consistent.

So, if you write into a slot, expect to find yourself in hell and do not complain. ;-)

Thinking a bit further along the philosophical line of this: my package is public under GPL. I not only allow you to adapt the code to your need, but I want to encourage you to develop/adapt the code for your needs. Actually it's really easy in R - everything is already there in a normal interactive R session, including access to the slots. Which is quite in a line with design decisions that make R very powerful but allow things like

> T <- FALSE
> `+` <- `-`
> pi <- 3
> pi + 2
[1] 1
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top