سؤال

I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:

data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
    rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}

So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".

I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)

Your help & time are very much appreciated! :-)

Best,

Lymond

UPDATE:

By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.

Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.

Thanks in advance, your help is very much appreciated!

هل كانت مفيدة؟

المحلول

Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:

# A single step:
x <- sample( unlist(data)) 
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16]) 

As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:

replic <- replicate(n=4, { x <- sample(unlist(data))
   list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) }  )

نصائح أخرى

After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.

In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.

It becomes clearer when viewed as code:

data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);

So far, everything as above

names <- c("set1", "set2", "set3", "set4", "set5");

In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)

names <- seq(1, length(data));

This "names" vector can then be expanded by "sizes" using rep:

cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5

This new vector "cut.by" can then by provided as argument to split()

rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1]  8  9 14  4
$`2`
[1] 10  2 15 13
$`3`
[1] 12
$`4`
[1] 11  3  5
$`5`
[1] 7 6 1

This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".

However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.

Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top