如何有效地生成 0 到上限 N 之间的 K 个非重复整数的列表 [重复]

https://stackoverflow.com/questions/158716

03-07-2019
|

题

这个问题在这里已经有答案了：

O(1) 中的唯一（非重复）随机数？ 21 个答案

该问题提供了所有必要的数据：什么是生成序列的有效算法 K 给定区间内不重复的整数 [0,N-1]. 。如果以下情况，那么简单的算法（生成随机数，然后在将它们添加到序列中之前查找它们以查看它们是否已经存在）非常昂贵 K 很大并且足够接近氮.

中提供的算法从链表中有效地选择一组随机元素似乎比必要的更复杂，并且需要一些实现。我刚刚发现另一种算法似乎可以很好地完成工作，只要您一次性知道所有相关参数即可。

解决方案

这随机模块来自 Python 库的代码使其变得非常简单和有效：

from random import sample
print sample(xrange(N), K)

sample 函数返回从给定序列中选择的 K 个唯一元素的列表。
xrange 是一个“列表模拟器”，即它的行为就像一个连续数字列表，而不在内存中创建它，这使得它对于像这样的任务来说非常快。

其他提示

在 计算机编程艺术，第 2 卷：半数值算法，第三版, ，Knuth 描述了以下选择采样算法：

算法 S（选择采样技术）。从一组 N 中随机选择 n 条记录，其中 0 < n ≤ N。

S1。[初始化] 设定 t ← 0，m ← 0。（在此算法中，m 表示到目前为止选择的记录数，t 是我们处理过的输入记录总数。）

S2。[生成 U.] 生成一个随机数 U，均匀分布在 0 到 1 之间。

S3。[测试]如果(N-t)U≥n-m，则进入步骤S5。

S4。[选择] 选择样本的下一条记录，并将 m 和 t 加 1。若m<n，则转步骤S2；否则样本完成并且算法终止。

S5。[跳过] 跳过下一条记录（不将其包含在样本中），将 t 加 1，然后返回步骤 S2。

实现可能比描述更容易理解。下面是一个 Common Lisp 实现，它从列表中随机选择 n 个成员：

(defun sample-list (n list &optional (length (length list)) result)
  (cond ((= length 0) result)
        ((< (* length (random 1.0)) n)
         (sample-list (1- n) (cdr list) (1- length)
                      (cons (car list) result)))
        (t (sample-list n (cdr list) (1- length) result))))

这是一个不使用递归的实现，并且适用于所有类型的序列：

(defun sample (n sequence)
  (let ((length (length sequence))
        (result (subseq sequence 0 n)))
    (loop
       with m = 0
       for i from 0 and u = (random 1.0)
       do (when (< (* (- length i) u) 
                   (- n m))
            (setf (elt result m) (elt sequence i))
            (incf m))
       until (= m n))
    result))

实际上可以在与所选元素数量成比例的空间中执行此操作，而不是与您从中选择的集合的大小成比例，无论您选择的总集合的比例是多少。您可以通过生成随机排列，然后从中进行选择来实现此目的，如下所示：

选择一个分组密码，例如茶或XTEA。使用异或折叠将块大小减小到比您选择的集合大的 2 的最小幂。使用随机种子作为密码的密钥。要生成排列中的元素 n，请使用密码对 n 进行加密。如果输出数字不在您的集合中，请对其进行加密。重复直到该数字位于集合内。平均而言，对于每个生成的数字，您需要执行的加密次数少于两次。这样做的另一个好处是，如果您的种子是加密安全的，那么您的整个排列也是安全的。

我对此写得更详细这里.

下面的代码（C语言，来源不明）似乎很好地解决了这个问题：

 /* generate N sorted, non-duplicate integers in [0, max[ */
 int *generate(int n, int max) {
    int i, m, a;    
    int *g = (int *)calloc(n, sizeof(int));
    if ( ! g) return 0;

    m = 0;
    for (i=0; i<max; i++) {
        a = random_in_between(0, max - i);
        if (a < n - m) {
            g[m] = i;
            m ++;
        }
    }
    return g;
 }

有谁知道我在哪里可以找到更多像这样的宝石？

生成一个数组 0...N-1 填充 a[i] = i.

然后将第一个打乱 K 项目。

洗牌：

开始 J = N-1
选择一个随机数 0...J （说， R)
交换 a[R] 和 a[J]
- 自从 R 可以等于 J, 元素可以与自身交换
减去 1 从 J 并重复。

最后，采取 K 最后一个元素。

这本质上是从列表中选取一个随机元素，将其移出，然后从剩余列表中选取一个随机元素，依此类推。

工作于 好的） 和在）时间，需要在）贮存。

洗牌部分称为 费舍尔-耶茨洗牌 或者 高德纳洗牌, ，在第二卷中描述 计算机编程的艺术。

通过将 K 个数字存储在哈希存储中来加速简单算法。在开始之前了解 K 可以消除插入哈希映射的所有低效率，并且您仍然可以获得快速查找的好处。

我的解决方案是面向 C++ 的，但我确信它可以翻译成其他语言，因为它非常简单。

首先，生成一个包含 K 个元素的链表，从 0 到 K
然后只要列表不为空，就生成一个介于 0 和向量大小之间的随机数
取出该元素，将其推入另一个向量，然后将其从原始列表中删除

该解决方案仅涉及两次循环迭代，并且没有哈希表查找或任何类似的事情。所以在实际代码中：

// Assume K is the highest number in the list
std::vector<int> sorted_list;
std::vector<int> random_list;

for(int i = 0; i < K; ++i) {
    sorted_list.push_back(i);
}

// Loop to K - 1 elements, as this will cause problems when trying to erase
// the first element
while(!sorted_list.size() > 1) {
    int rand_index = rand() % sorted_list.size();
    random_list.push_back(sorted_list.at(rand_index));
    sorted_list.erase(sorted_list.begin() + rand_index);
}                 

// Finally push back the last remaining element to the random list
// The if() statement here is just a sanity check, in case K == 0
if(!sorted_list.empty()) {
    random_list.push_back(sorted_list.at(0));
}

步骤1：生成整数列表。
第2步：履行高德纳洗牌.

请注意，您不需要对整个列表进行洗牌，因为 Knuth Shuffle 算法允许您仅应用 n 次洗牌，其中 n 是要返回的元素数。生成列表仍然需要与列表大小成正比的时间，但您可以重复使用现有列表来满足将来的任何洗牌需求（假设大小保持不变），而无需在重新启动洗牌算法之前对部分洗牌的列表进行预洗牌。

Knuth Shuffle 的基本算法是从整数列表开始。然后，将第一个整数与列表中的任意数字交换并返回当前（新的）第一个整数。然后，将第二个整数与列表中的任何数字（第一个除外）交换，并返回当前（新的）第二个整数。然后……等等……

这是一个极其简单的算法，但要小心在执行交换时将当前项目包含在列表中，否则会破坏算法。

水库采样版本非常简单：

my $N = 20;
my $k;
my @r;

while(<>) {
  if(++$k <= $N) {
    push @r, $_;
  } elsif(rand(1) <= ($N/$k)) {
    $r[rand(@r)] = $_;
  }
}

print @r;

这是从 STDIN 中随机选择的 $N 行。如果您不使用文件中的行，请将 <>/$_ 内容替换为其他内容，但这是一个非常简单的算法。

如果列表是排序的，例如想要从N中提取K个元素，但不关心它们的相对顺序，论文提出了一种高效的算法 一种高效的顺序随机采样算法 （杰弗里·斯科特·维特， ACM 数学软件汇刊, ，卷。13、没有。1，1987 年 3 月，第 56-67 页。）。

已编辑 使用 boost 添加 C++ 代码。我刚刚打字，可能有很多错误。随机数来自 boost 库，带有一个愚蠢的种子，所以不要对此做任何严肃的事情。

/* Sampling according to [Vitter87].
 * 
 * Bibliography
 * [Vitter 87]
 *   Jeffrey Scott Vitter, 
 *   An Efficient Algorithm for Sequential Random Sampling
 *   ACM Transactions on MAthematical Software, 13 (1), 58 (1987).
 */

#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <string>
#include <iostream>

#include <iomanip>

#include <boost/random/linear_congruential.hpp>
#include <boost/random/variate_generator.hpp>
#include <boost/random/uniform_real.hpp>

using namespace std;

// This is a typedef for a random number generator.
// Try boost::mt19937 or boost::ecuyer1988 instead of boost::minstd_rand
typedef boost::minstd_rand base_generator_type;

    // Define a random number generator and initialize it with a reproducible
    // seed.
    // (The seed is unsigned, otherwise the wrong overload may be selected
    // when using mt19937 as the base_generator_type.)
    base_generator_type generator(0xBB84u);
    //TODO : change the seed above !
    // Defines the suitable uniform ditribution.
    boost::uniform_real<> uni_dist(0,1);
    boost::variate_generator<base_generator_type&, boost::uniform_real<> > uni(generator, uni_dist);



void SequentialSamplesMethodA(int K, int N) 
// Outputs K sorted random integers out of 0..N, taken according to 
// [Vitter87], method A.
    {
    int top=N-K, S, curr=0, currsample=-1;
    double Nreal=N, quot=1., V;

    while (K>=2)
        {
        V=uni();
        S=0;
        quot=top/Nreal;
        while (quot > V)
            {
            S++; top--; Nreal--;
            quot *= top/Nreal;
            }
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        Nreal--; K--;curr++;
        }
    // special case K=1 to avoid overflow
    S=floor(round(Nreal)*uni());
    currsample+=1+S;
    cout << curr << " : " << currsample << "\n";
    }

void SequentialSamplesMethodD(int K, int N)
// Outputs K sorted random integers out of 0..N, taken according to 
// [Vitter87], method D. 
    {
    const int negalphainv=-13; //between -20 and -7 according to [Vitter87]
    //optimized for an implementation in 1987 !!!
    int curr=0, currsample=0;
    int threshold=-negalphainv*K;
    double Kreal=K, Kinv=1./Kreal, Nreal=N;
    double Vprime=exp(log(uni())*Kinv);
    int qu1=N+1-K; double qu1real=qu1;
    double Kmin1inv, X, U, negSreal, y1, y2, top, bottom;
    int S, limit;
    while ((K>1)&&(threshold<N))
        {
        Kmin1inv=1./(Kreal-1.);
        while(1)
            {//Step D2: generate X and U
            while(1)
                {
                X=Nreal*(1-Vprime);
                S=floor(X);
                if (S<qu1) {break;}
                Vprime=exp(log(uni())*Kinv);
                }
            U=uni();
            negSreal=-S;
            //step D3: Accept ?
            y1=exp(log(U*Nreal/qu1real)*Kmin1inv);
            Vprime=y1*(1. - X/Nreal)*(qu1real/(negSreal+qu1real));
            if (Vprime <=1.) {break;} //Accept ! Test [Vitter87](2.8) is true
            //step D4 Accept ?
            y2=0; top=Nreal-1.;
            if (K-1 > S)
                {bottom=Nreal-Kreal; limit=N-S;}
            else {bottom=Nreal+negSreal-1.; limit=qu1;}
            for(int t=N-1;t>=limit;t--)
                {y2*=top/bottom;top--; bottom--;}
            if (Nreal/(Nreal-X)>=y1*exp(log(y2)*Kmin1inv))
                {//Accept !
                Vprime=exp(log(uni())*Kmin1inv);
                break;
                }
            Vprime=exp(log(uni())*Kmin1inv);
            }
        // Step D5: Select the (S+1)th record
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        curr++;
        N-=S+1; Nreal+=negSreal-1.;
        K-=1; Kreal-=1; Kinv=Kmin1inv;
        qu1-=S; qu1real+=negSreal;
        threshold+=negalphainv;
        }
    if (K>1) {SequentialSamplesMethodA(K, N);}
    else {
        S=floor(N*Vprime);
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        }
    }


int main(void)
    {
    int Ntest=10000000, Ktest=Ntest/100;
    SequentialSamplesMethodD(Ktest,Ntest);
    return 0;
    }

$ time ./sampling|tail

在我的笔记本电脑上给出以下输出

99990 : 9998882
99991 : 9998885
99992 : 9999021
99993 : 9999058
99994 : 9999339
99995 : 9999359
99996 : 9999411
99997 : 9999427
99998 : 9999584
99999 : 9999745

real    0m0.075s
user    0m0.060s
sys 0m0.000s

这段 Ruby 代码展示了油藏采样，算法 R 方法。在每个周期中，我选择 n=5 唯一的随机整数 [0,N=10) 范围：

t=0
m=0
N=10
n=5
s=0
distrib=Array.new(N,0)
for i in 1..500000 do
 t=0
 m=0
 s=0
 while m<n do

  u=rand()
  if (N-t)*u>=n-m then
   t=t+1
  else 
   distrib[s]+=1
   m=m+1
   t=t+1
  end #if
  s=s+1
 end #while
 if (i % 100000)==0 then puts i.to_s + ". cycle..." end
end #for
puts "--------------"
puts distrib

输出：

100000. cycle...
200000. cycle...
300000. cycle...
400000. cycle...
500000. cycle...
--------------
250272
249924
249628
249894
250193
250202
249647
249606
250600
250034

0-9 之间的所有整数都以几乎相同的概率被选择。

本质上是高德纳算法应用于任意序列（事实上，该答案有一个 LISP 版本）。算法是在）及时并且可以 复杂度(1) 如果序列流式传输到内存中，如下所示 @MichaelCramer 的回答.

这是一种无需额外存储即可在 O(N) 中完成此操作的方法。我很确定这不是纯粹的随机分布，但对于许多用途来说它可能足够接近。

/* generate N sorted, non-duplicate integers in [0, max[  in O(N))*/
 int *generate(int n, int max) {
    float step,a,v=0;
    int i;    
    int *g = (int *)calloc(n, sizeof(int));
    if ( ! g) return 0;

    for (i=0; i<n; i++) {
        step = (max-v)/(float)(n-i);
        v+ = floating_pt_random_in_between(0.0, step*2.0);
        if ((int)v == g[i-1]){
          v=(int)v+1;             //avoid collisions
        }
        g[i]=v;
    }
    while (g[i]>max) {
      g[i]=max;                   //fix up overflow
      max=g[i--]-1;
    }
    return g;
 }

这是 Perl 代码。Grep 是一个过滤器，和往常一样我没有测试这段代码。

@list = grep ($_ % I) == 0, (0..N);

I = 间隔
N = 上限

仅通过模运算符获取与您的间隔匹配的数字。

@list = grep ($_ % 3) == 0, (0..30);

将返回 0, 3, 6, ...30

这是伪 Perl 代码。您可能需要调整它才能编译。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow