How to split data 70:30 and get a different range of data everytime you split it

https://stackoverflow.com/questions/19983160

30-07-2022
|

Question

I'm currently using R to do feature selection through the use of Random Forest regression. I want to split my data 70:30, which is easy enough to do. However, I want to be able to do this 10 times, with each 10 times obtaining a different set of examples from the one before.

> trainIndex<- createDataPartition(lipids$RT..seconds., p=0.7, list=F)
> lipids.train <- lipids[trainIndex, ]
> lipids.test <- lipids[-trainIndex, ]

This is what I'm doing at the moment, and it works great for splitting my data 70:30. But when I do it again , I get the same 70% of the data in my training set, and the same 30% of the data in my test data. I know this is how createDataPartition works, but is there way of making it so that I get a different 70% of the data the next time I perform it?

Thanks

Solution

In the future, please include the packages you're using since createDataPartition is not in base R. I'm assuming you're using the caret package. If that is correct, did you find the times argument?

trainIndex<- createDataPartition(lipids$RT..seconds., p=0.7, list=F, times=10)

As mentioned in the comment, you can just as simply use sample:

sample(seq_along(lipids$RD..seconds), as.integer(0.7 * nrow(lipids)))

And sample will choose a different random seed each time it is run, so you will get different orders.

OTHER TIPS

library(dplyr)
n <- as.integer(length(data[,1])*0.7)
data_70 <- data[sample(nrow(data),n), ]
data_30 <- anti_join(data, data_70)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow