Question

I tried to use the following code to subset the iris data

datanew = subset(iris, Species == c("setosa", "virginica"), select = -Species)

but the result I am getting is only the 1, 3, 5, 7...rows. Why did I end up getting odd rows?

Was it helpful?

Solution

Short answer: because of vector recycling, http://cran.r-project.org/doc/manuals/R-intro.html#Vector-arithmetic.

Long answer: when you are doing a comparison of the form x == y, if vectors x and y don't have the same length, the short one is recycled to match the length of the long one.

For example,

> x <- c(1, 1, 1, 2, 2, 2)
> y <- c(1, 2)
> x == y
[1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE

You are essentially getting a result of comparison c(1, 1, 1, 2, 2, 2) == c(1, 2, 1, 2, 1, 2) - notice the second, shorter, vector is recycled.

subset uses vector comparison internally, and so the recycling rule applies there as well.

To get what you want, you have to replace == with %in%, as @Vincent has pointed out in another thread.

OTHER TIPS

I think you want:

subset(iris, Species %in% c("setosa", "virginica"), select = -Species)

It's not only returning odd rows!

tail(subset(iris, Species == c("setosa", "virginica"), select = -Species))

##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 140          6.9         3.1          5.4         2.1
## 142          6.9         3.1          5.1         2.3
## 144          6.8         3.2          5.9         2.3
## 146          6.7         3.0          5.2         2.3
## 148          6.5         3.0          5.2         2.0
## 150          5.9         3.0          5.1         1.8

This is due to R's recycling. Look at the output of iris$Species == c("setosa", "virginica"). It is switching off between testing Species == "setosa" and Species == "virginica". Since there happens to be an even number of rows in the data R c("setosa", "virginica") recycles with no remainder and R assumes that you wanted to recycle.

If we add another row we get a warning message

iris <- rbind(iris, tail(iris, 1))

foo <- subset(iris, Species == c("setosa", "virginica"))

## Warning messages:
## 1: In is.na(e1) | is.na(e2) :
##   longer object length is not a multiple of shorter object length
## 2: In `==.default`(Species, c("setosa", "virginica")) :
##   longer object length is not a multiple of shorter object length

You want to use %in%

datanew <- subset(iris, Species %in% c("setosa", "virginica"), select = -Species)

head(datanew)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

tail(datanew)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 145          6.7         3.3          5.7         2.5
## 146          6.7         3.0          5.2         2.3
## 147          6.3         2.5          5.0         1.9
## 148          6.5         3.0          5.2         2.0
## 149          6.2         3.4          5.4         2.3
## 150          5.9         3.0          5.1         1.8
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top