Why subset does not work with a vector name identical to a column name?

https://stackoverflow.com/questions/10815696

r
subset

11-06-2021
|

Question

I came across a confusing "feature" of subset function (using column name as a vector name for subsetting does not work):

data(iris)
Species <- unique(iris$Species)
i <- 2
Species[i]
subset(iris, subset = Species == Species[i])

sp <- unique(iris$Species)
sp[i]
subset(iris, subset = Species == sp[i])

Could someone explain me, what happens here and why?

Solution

subset() will first look inside the dataframe for any object you mention, so in your first example Species[i] returns 'setosa' (the same as iris$Species[i]). Only when the object you specify cannot be found inside the data frame, R looks in the parent frames and will find the correct object there.

So it all does work, you just don't understand how it works. You could have read this in the help files :

Note that subset will be evaluated in the data frame, so columns can be referred to (by name) as variables in the expression (see the examples).

How does this come about?

The reason is the following lines of code in subset() :

e <- substitute(subset)
r <- eval(e, x, parent.frame())

subset (or e) is in your example Species == Species[i]
x is in your example iris
parent.frame() returns in your example the global environment.

The second argument of the call to eval, x is called envir. It is the environment (or list or data frame, ...) where the expression is evaluated. In your case, R evaluates Species == Species[i] inside x, which is your data frame.

The third argument, parent.frame(), is the enclosure. This is the environments that encloses the data frame you specified als environment, and is the place where R will look in case the variables aren't found in the dataframe.