Algoritmo para muestreo sin reemplazo?

https://stackoverflow.com/questions/311703

10-07-2019
|

Pregunta

Estoy tratando de probar la probabilidad de que un grupo particular de datos haya ocurrido por casualidad. Una forma sólida de hacer esto es la simulación de Monte Carlo, en la que las asociaciones entre datos y grupos se reasignan aleatoriamente una gran cantidad de veces (por ejemplo, 10,000), y se utiliza una métrica de agrupamiento para comparar los datos reales con las simulaciones para determinar una p valor.

Tengo la mayor parte de esto funcionando, con punteros que asignan la agrupación a los elementos de datos, así que planeo reasignar al azar punteros a datos. LA PREGUNTA: ¿cuál es una forma rápida de muestrear sin reemplazo, para que cada puntero se reasigne aleatoriamente en los conjuntos de datos replicados?

Por ejemplo (estos datos son solo un ejemplo simplificado):

Datos (n = 12 valores) - Grupo A: 0.1, 0.2, 0.4 / Grupo B: 0.5, 0.6, 0.8 / Grupo C: 0.4, 0.5 / Grupo D: 0.2, 0.2, 0.3, 0.5

Para cada conjunto de datos replicado, tendría los mismos tamaños de clúster (A = 3, B = 3, C = 2, D = 4) y valores de datos, pero reasignaría los valores a los clústeres.

Para hacer esto, podría generar números aleatorios en el rango 1-12, asignar el primer elemento del grupo A, luego generar números aleatorios en el rango 1-11 y asignar el segundo elemento en el grupo A, y así sucesivamente. La reasignación del puntero es rápida y habré asignado previamente todas las estructuras de datos, pero el muestreo sin reemplazo parece un problema que podría haberse resuelto muchas veces antes.

Lógica o pseudocódigo preferido.

Solución

Vea mi respuesta a esta pregunta Números aleatorios únicos (no repetitivos) en O ( 1)? . La misma lógica debería lograr lo que está buscando hacer.

Otros consejos

Aquí hay un código para el muestreo sin reemplazo basado en el Algoritmo 3.4.2S del libro de Knuth Algoritmos Seminuméricos.

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

Hay un método más eficiente pero más complejo por Jeffrey Scott Vitter en "Algoritmo eficiente para muestreo aleatorio secuencial". Transacciones de ACM en software matemático, 13 (1), marzo de 1987, 58-67.

Un código de trabajo C ++ basado en la respuesta de John D. Cook .

#include <random>
#include <vector>

double GetUniform()
{
    static std::default_random_engine re;
    static std::uniform_real_distribution<double> Dist(0,1);
    return Dist(re);
}

// John D. Cook, https://stackoverflow.com/a/311716/15485
void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    std::vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

#include <iostream>
int main(int,char**)
{
  const size_t sz = 10;
  std::vector< int > samples(sz);
  SampleWithoutReplacement(10*sz,sz,samples);
  for (size_t i = 0; i < sz; i++ ) {
    std::cout << samples[i] << "\t";
  }

  return 0;
}

Inspirado por la respuesta de @John D. Cook , escribí una implementación en Nim. Al principio tuve dificultades para entender cómo funciona, así que comenté extensamente y también incluí un ejemplo. Quizás ayude entender la idea. Además, he cambiado ligeramente los nombres de las variables.

iterator uniqueRandomValuesBelow*(N, M: int) =
  ## Returns a total of M unique random values i with 0 <= i < N
  ## These indices can be used to construct e.g. a random sample without replacement
  assert(M <= N)

  var t = 0 # total input records dealt with
  var m = 0 # number of items selected so far

  while (m < M):
    let u = random(1.0) # call a uniform(0,1) random number generator

    # meaning of the following terms:
    # (N - t) is the total number of remaining draws left (initially just N)
    # (M - m) is the number how many of these remaining draw must be positive (initially just M)
    # => Probability for next draw = (M-m) / (N-t)
    #    i.e.: (required positive draws left) / (total draw left)
    #
    # This is implemented by the inequality expression below:
    # - the larger (M-m), the larger the probability of a positive draw
    # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
    # - for (N-t) >> (M-m), we must get a very small u
    #
    # example: (N-t) = 7, (M-m) = 5
    # => we draw the next with prob 5/7
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 6
    # => we draw the next with prob 5/6
    #    lets assume the draw succeeds
    # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
    # => we draw the next with prob 4/5
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 4
    # => we draw the next with prob 4/4, i.e.,
    #    we will draw with certainty from now on
    #    (in the next steps we get prob 3/3, 2/2, ...)
    if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
      # no draw -- happens mainly for (N-t) >> (M-m) and/or high u
      t += 1
    else:
      # draw t -- happens when (M-m) gets large and/or low u
      yield t # this is where we output an index, can be used to sample
      t += 1
      m += 1

# example use
for i in uniqueRandomValuesBelow(100, 5):
  echo i

Cuando el tamaño de la población es mucho mayor que el tamaño de la muestra, los algoritmos anteriores se vuelven ineficientes, ya que tienen complejidad O ( n ), n ser el tamaño de la población.

Cuando era estudiante escribí algunos algoritmos para el muestreo uniforme sin reemplazo, que tienen una complejidad promedio O ( s log s ), donde s es el tamaño de la muestra. Aquí está el código para el algoritmo de árbol binario, con una complejidad promedio O ( s log s ), en R:

# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin 
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
  s=as.integer(size)
  if (s>n) {
    stop("Sample size is greater than the number of items to choose from")
  }
  # upv=integer(s) #level up edge is pointing to
  leftv=integer(s) #left edge is poiting to; must be filled with zeros
  rightv=integer(s) #right edge is pointig to; must be filled with zeros
  samp=integer(s) #the sample
  ordn=integer(s) #relative ordinal number

  ordn[1L]=1L #initial value for the root vertex
  samp[1L]=sample(n,1L) 
  if (s > 1L) for (j in 2L:s) {
    curn=sample(n-j+1L,1L) #current number sampled
    curordn=0L #currend ordinal number
    v=1L #current vertice
    from=1L #how have come here: 0 - by left edge, 1 - by right edge
    repeat {
      curordn=curordn+ordn[v]
      if (curn+curordn>samp[v]) { #going down by the right edge
        if (from == 0L) {
          ordn[v]=ordn[v]-1L
        }
        if (rightv[v]!=0L) {
          v=rightv[v]
          from=1L
        } else { #creating a new vertex
          samp[j]=curn+curordn
          ordn[j]=1L
          # upv[j]=v
          rightv[v]=j
          break
        }
      } else { #going down by the left edge
        if (from==1L) {
          ordn[v]=ordn[v]+1L
        }
        if (leftv[v]!=0L) {
          v=leftv[v]
          from=0L
        } else { #creating a new vertex
          samp[j]=curn+curordn-1L
          ordn[j]=-1L
          # upv[j]=v
          leftv[v]=j
          break
        }
      }
    }
  }
  return(samp)  
}

La complejidad de este algoritmo se discute en: Rouzankin, P. S .; Voytishek, A. V. Sobre el costo de los algoritmos para la selección aleatoria. Métodos Monte Carlo Appl. 5 (1999), no. 1, 39-54. http://dx.doi.org/10.1515/mcma.1999.5.1.39

Si encuentra útil el algoritmo, haga una referencia.

Ver también: P. Gupta, G. P. Bhattacharjee. (1984) Un algoritmo eficiente para muestreo aleatorio sin reemplazo. International Journal of Computer Mathematics 16: 4, páginas 201-209. DOI: 10.1080 / 00207168408803438

Teuhola, J. y Nevalainen, O. 1982. Dos algoritmos eficientes para el muestreo aleatorio sin reemplazo. / IJCM /, 11 (2): 127–140. DOI: 10.1080 / 00207168208803304

En el último artículo, los autores usan tablas hash y afirman que sus algoritmos tienen una complejidad O ( s ). Hay un algoritmo más rápido de tabla hash, que pronto se implementará en pqR (R bastante rápido): https://stat.ethz.ch/pipermail/r- devel / 2017-octubre / 075012.html

Se describe otro algoritmo para el muestreo sin reemplazo aquí .

Es similar al descrito por John D. Cook en su respuesta y también de Knuth, pero tiene una hipótesis diferente: el tamaño de la población es desconocido, pero la muestra puede caber en la memoria. Este se llama " algoritmo de Knuth S " ;.

Citando el artículo de rosettacode:

Seleccione los primeros n elementos como muestra a medida que estén disponibles;

Para el ítem i-ésimo donde i > n, tiene una posibilidad aleatoria de n / i de mantenerlo. Si falla esta oportunidad, la muestra sigue siendo la misma. Si   no, que sea aleatorio (1 / n) reemplace uno de los n seleccionados previamente   elementos de la muestra.

Repita # 2 para cualquier elemento posterior.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow