Question

I want to create training and test data from mydata, which has 2673 observations and 23 variables. However, I am not able to create the test set just by simply subtracting the training data.

dim(mydata)
## [1] 2673   23
set.seed(1)
train = mydata[sample(1:nrow(mydata), 1000, replace=FALSE), ]
dim(train)
## [1] 1000   23

When I run the following, I got 19 warnings and the the result has 20,062 observations:

test = mydata[!train, ]
## There were 19 warnings (use warnings() to see them)
dim(test)
## [1] 20062    23

What am I doing wrong?

Was it helpful?

Solution

A possible solution involves storing the sampled indices in a separate named vector.

train_idx <- sample(1:nrow(mydata),1000,replace=FALSE)
train <- mydata[train_idx,] # select all these rows
test <- mydata[-train_idx,] # select all but these rows

Also, knowing that a data.frame's row.names attribute must consist of unique values, you may also set e.g.

test <- mydata[!(row.names(mydata) %in% row.names(train)), ]

But the second solution is 2x slower on mydata <- data.frame(a=1:100000, b=rep(letters, len=100000)), as measured by microbenchmark().

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top