Question

I have a large character-vector file and I need to draw a random sample from it. This works fine. But I need to draw sample after sample. For that I want to shorten file by every element that is already drawn out of it (that I can draw a new sample without drawing the same element more than once).

I've got some solution, but I'm interested in anything else that might work faster and even more important, maybe correctly.

Here are my tries:

Approach 1

file <- rep(1:10000)
rand_no <- sample(file, 100)

library(car)
a <- data.frame()

for (i in 1:length(rand_no)){
     a <- rbind(a, which.names(rand_no[i], file))
     file <- file[-a[1,1]]
}

Problem:

Warning message:
In which.names(rand_no[i], file) : 297 not matched

Approach 2

file <- rep(1:10000)
rand_no <- sample(file, 100)

library(car)
deleter <- function(i) {
   a <- which.names(rand_no[i], file)
   file <- file[-a]
}

lapply(1:length(rand_no), deleter)

Problem: This doesn't work at all. Maybe I should split the quesion, because the second problem clearly lies with me not fully understanding lapply.

Thanks for any suggestions.

Edit

I hoped that it will work with numbers, but of course file looks like this:

file <- c("Post-19960101T000000Z-1.tsv", "Post-19960101T000000Z-2.tsv", "Post-19960101T000000Z-3.tsv","Post-19960101T000000Z-4.tsv", "Post-19960101T000000Z-5.tsv", "Post-19960101T000000Z-6.tsv", "Post-19960101T000000Z-7.tsv","Post-19960101T000000Z-9.tsv")

Of course rand_no can't be over 100 files with such a small sample. Therefore:

 rand_no <- sample(file, 2)
Was it helpful?

Solution

Use list instead of c. Then you can set the values to NULL and they will be removed.

file[file %in% rand_no] <- NULL This find all instances from rand_no in file and removes them.

file <- list("Post-19960101T000000Z-1.tsv",
 "Post-19960101T000000Z-2.tsv",
 "Post-19960101T000000Z-3.tsv",
 "Post-19960101T000000Z-4.tsv",
 "Post-19960101T000000Z-5.tsv",
 "Post-19960101T000000Z-6.tsv",
 "Post-19960101T000000Z-7.tsv",
 "Post-19960101T000000Z-9.tsv")
rand_no <- sample(file, 2)

library(car) #From poster's code.

file[file %in% rand_no] <- NULL

If you are working with a large list of files, using %in% to compare strings may bog you down. In that case I would use indexes.

file <- list("Post-19960101T000000Z-1.tsv",
             "Post-19960101T000000Z-2.tsv",
             "Post-19960101T000000Z-3.tsv",
             "Post-19960101T000000Z-4.tsv",
             "Post-19960101T000000Z-5.tsv",
             "Post-19960101T000000Z-6.tsv",
             "Post-19960101T000000Z-7.tsv",
             "Post-19960101T000000Z-9.tsv")
rand_no <- sample(1:length(file), 2)

library(car) #From poster's code.

file[rand_no] <- NULL

OTHER TIPS

Sample() already returns values in a permuted order with no replacements (unless you set replace=T). So it will never pick a value twice.

So if you want three sets of 100 samples that don't share any elements, you can use

file <- rep(1:10000)
rand_no <- sample(seq_along(file), 300)

s1<-file[rand_no[1:100]]
s2<-file[rand_no[101:200]]
s3<-file[rand_no[201:300]]

Or if you wanted to decease the total size by 100 each time you could do

s1<-file[-rand_no[1:100]]
s2<-file[-rand_no[1:200]]
s3<-file[-rand_no[1:300]]

A simple approach would be to select random indices and then remove those indices:

file <- 1:10000  # Build sample data
ind <- sample(seq(length(file)), 100)  # Select random indices
rand_no <- file[ind]  # Compute the actual values selected
file <- file[-ind]  # Remove selected indices

I think using sample and split could be a nice way of doing this, without having to alter your files variable. I'm not a big fan of mutation, unless you really need to, and this would let you know exactly which files you used for each chunk of the analysis going forward.

files<-paste("file",1:100,sep="_")     
randfiles<-sample(files, 50)
randfiles_chunks<-split(randfiles,seq(1,length(randfiles), by=10))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top