質問

I have a huge linked list of integers (let's say it has size N, but N is unknown to me) and want to get k random values from it in minimum possible time/space.

I think it must be possible to write a variation of inside-out Fisher-Yates shuffle that will solve this problem in O(N) time and O(k) additional space.

Can anyone help me to get statistically correct solution with specified time/space bounds?

I think my current code is close to the correct solution:

public class Node
{
    public int Data;

    public Node Next;

    // O(N) time, O(k) additional space
    public int[] GetRandomData(int k)
    {
        var a = new int[k];
        var rand = new Random();

        int n = 0;
        Node cur = this;
        while (cur != null)
        {
            int r = rand.Next(0, n);

            if (r < k)
            {
                a[r] = cur.Data;
            }

            cur = cur.Next;          
        }

        if (n < k) throw new ArgumentException("k is bigger than N");        
        return a;
    }
}
役に立ちましたか?

解決

This returns a uniformly distributed random sample of k items from a sequence of unknown length. The algorithm is called reservoir sampling.

def sample(seq, k):
    seq = iter(seq)
    result = [seq.next() for _ in xrange(k)]
    for i, s in enumerate(seq):
        r = random.randrange(i + k + 1)
        if r < k: result[r] = s
    return result

他のヒント

I ended up with this version (C#) and I'm pretty sure that it's correct. Tell me if I'm wrong.

public class Node
{
    public int Data;

    public Node Next;

    // O(N) time, O(k) additional space
    public int[] GetRandomData(int k)
    {
        var a = new int[k];
        var rand = new Random();

        a[0] = this.Data;
        int i = 1;

        for (Node cur = this.Next; cur != null; cur = cur.Next, i = i + 1)
        {
            int r = rand.Next(0, i + 1);

            if (r < k)
            {
                if (i < k)
                {
                    a[i] = a[r];
                }

                a[r] = cur.Data;
            }
        }

        if (i < k) throw new ArgumentException("k is bigger than N");
        return a;
    }
}

UPD: ok, so my code is identical to what's written here in wikipedia, so it must be statistically correct.

This is the implementation of sattolo algorithm

from random import randrange  
def sattoloCycle(items):
    i = len(items)
    while i > 1:
        i = i - 1
        j = randrange(i)  # 0 <= j <= i-1
        items[j], items[i] = items[i], items[j]
    return
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top