Shuffle using IComparer

https://stackoverflow.com/questions/557911

05-09-2019
|

Question

First of all, I do know about the Fisher-Yates shuffle. But lets say for arguments sake that I want to allow the user to pick a sort option from a Dropdown list. This list would include a "Random" option. Based on the result of their selection I just want to substitute in an IComparer instance for my sort. What would the IComparer look like?

Google brings up a plethora of flawed results that all take this form:

public class NaiveRandomizer<T> : IComparer<T>
{
    private static Random rand = new Random();

    public int Compare(T x, T y)
    {
        return (x.Equals(y))?0:rand.Next(-1, 2);
    }
}

However, that implementation is biased and will even throw an exception in some circumstances. The bias can be demonstrated with the following code:

void Test()
{
    Console.WriteLine("NaiveRandomizer Test:");
    var data = new List<int>() {1,2,3};
    var sortCounts = new Dictionary<string, int>(6);
    var randomly = new NaiveRandomizer<int>();

    for (int i=0;i<10000;i++)
    {   //always start with same list, in _the same order_.
        var dataCopy = new List<int>(data); 
        dataCopy.Sort(randomly);

        var key = WriteList(dataCopy);
        if (sortCounts.ContainsKey(key))
            sortCounts[key]++;
        else
            sortCounts.Add(key, 1);
    }

    foreach (KeyValuePair<string, int> item in sortCounts)
        Console.WriteLine(item.Key + "\t" + item.Value);
}

string WriteList<T>(List<T> list)
{
   string delim = "";
   string result = "";
   foreach(T item in list)
   {
       result += delim + item.ToString();
       delim = ", ";
   }
   return result;
}

So how could you implement a random IComparer<T> that solved those issues? It is allowed to require each call to .Sort() to use a separate IComparer instance, as I don't see any other way to do this: items must be compared using some other, truly random value, but that value must also be consistent for an item within a given sort operation.

I have a start here, but it was posted in haste, is extremely slow, and doesn't even return all possible sorts (testing shows that it does at least eliminate bias, if you don't count the missing options). I don't expect O(n) performance like Fisher-Yates, but I do want something reasonable (n log n for a small-ish n), and I do expect it to show all possible sorts. Unfortunately, that link is the current accepted answer for it's question and so I'm hoping to be able to replace it with something a little better.

If nothing else, I want this to be a magnet for all those google queries looking for an IComparable solution- that they'll end up here instead of somewhere else telling them to use the incorrect version.

Solution 2

One suggestion I got elsewhere was to create a separate IArranger interface that describes a single operation to Arrange a collection. This can work where IComparer/IComparable cannot because it operates on an entire collection, instead of individual items. It might look something like this:

public interface IArranger<T>
{
    IEnumerable<T> Arrange(IEnumerable<T> items);
}

Then I could implement a Shuffle from the IArranger interface using a proper Fisher-Yates algorithm, and also have implementations that wrap each additional IEnumerable.Sort()/IComparable/IComparer varieties that I care about. That might look something like this:

public class ComparerArranger<T> : IArranger<T>
{
    private IComparer<T> comparer;

    public ComparableArranger(IComparer<T> comparer)
    {
        this.comparer = comparer;
    }

    public IEnumerable<T> Arrange(IEnumerable<T> items)
    {
       return items.OrderBy(i => i, comparer);
    }
}

//uses the default Comparer for the type (Comparer<T>.Default)
public class TypeArranger<T> : IArranger<T> 
{
    public IEnumerable<T> Arrange(IEnumerable<T> items)
    {
       return items.OrderBy(i => i);
    }
}

public class ShuffleArranger<T> : IArranger<T>
{
    //naive implementation for demonstration
    // if I ever develop this more completely I would try to
    // avoid needing to call .ToArray() in here
    // and use a better prng
    private Random r = new Random();

    public IEnumerable<T> Arrange(IEnumerable<T> items)
    {
        var values = items.ToArray();

        //valid Fisher-Yates shuffle on the values array
        for (int i = values.Length; i > 1; i--)
        {
            int j = r.Next(i);
            T tmp = values[j];
            values[j] = values[i - 1];
            values[i - 1] = tmp;
        }
        foreach (var item in values) yield return item;
    }
}

For a final step, I add support for this to any IEnumerable via an extension method. Then you still get the simple run-time algorithm swapping, you have a better implementation of the shuffle algorithm, and the code to use it feels natural:

public static IEnumerable<T> Arrange(this IEnumerable<T> items, IArranger<T> arranger)
{
    return arranger.Arrange(items);
}

OTHER TIPS

I was somewhat surprised in this thread how many wrong answers were posted. Just for the sake of others who come up with a solution similar to the one posted by the OP, the following code looks correct:

int[] nums = new int[1000];
for (int i = 0; i < nums.Length; i++)
{
    nums[i] = i;
}

Random r = new Random();
Array.Sort<int>(nums, (x, y) => r.Next(-1, 2));

foreach(var num in nums)
{
    Console.Write("{0} ", num);
}

However, the code will throw an exception occasionally, but not always. That's what makes it fun to debug :) If you run it enough times, or execute the sort procedure in a loop 50 or so times, you'll get an error stating:

IComparer (or the IComparable methods it relies upon) did not return zero when Array.Sort called x. CompareTo(x). x: '0' x's type: 'Int32' The IComparer: ''.

In other words, the quick sort compared some number x to itself and got a non-zero result. The obvious solution to the code would be write:

Array.Sort<int>(nums, (x, y) =>
    {
        if (x == y) return 0;
        else return r.NextDouble() < 0.5 ? 1 : -1;
    });

But even this doesn't work, because there are occasions where .NET compares 3 numbers against one another which return inconsistent results, such as A > B, B > C, and C > A (oops!). No matter if you use a Guid, GetHashCode, or any other randomly generated input, a solution like the one shown above is still wrong.

With that being said, Fisher-Yates is the standard way of shuffling arrays, so there's no real reason to use IComparer in the first place. Fisher-Yates is O(n) whereas any implementation using IComparer uses a quicksort behinds the scenes which has a time-complexity of O(n log n). There's just no good reason not to use the well-known, efficient, standard algorithm to solve this kind of problem.

However, if you really insist on using an IComparer and a rand, then apply your random data before you sort. This requires a projection of the data onto another object so you don't lose your random data:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConsoleApplication1
{
    class Pair<T, U>
    {
        public T Item1 { get; private set; }
        public U Item2 { get; private set; }
        public Pair(T item1, U item2)
        {
            this.Item1 = item1;
            this.Item2 = item2;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            Pair<int, double>[] nums = new Pair<int, double>[1000];
            Random r = new Random();
            for (int i = 0; i < nums.Length; i++)
            {
                nums[i] = new Pair<int, double>(i, r.NextDouble());
            }

            Array.Sort<Pair<int, double>>(nums, (x, y) => x.Item2.CompareTo(y.Item2));

            foreach (var item in nums)
            {
                Console.Write("{0} ", item.Item1);
            }

            Console.ReadKey(true);
        }
    }
}

Or get LINQy with your bad self:

Random r = new Random();
var nums = from x in Enumerable.Range(0, 1000)
           orderby r.NextDouble()
           select x;

IComparer requiring a zero return at some point (for equal instances of T), makes it mathematically impossible to create a generic IComparer that will mimic a Fisher-Yates Shuffle statistically. There will always be a bias. For a real shuffle, you'd never want to force it to return any particular value.

How 'bout sorting based on a hidden field, which is pre-assigned a random value?

To follow up on James Curran's idea: let the IComparer maintain the "sorted" values as a list; if a new value occurs, insert it into the list at a random position; compare by list index. Optimize by maintaining the list as a balanced tree or something. Every instance of such a IComparer will maintain a consistent and random sort order so you have the choice of letting your Random sorting be consistently the same random ordering or a different one each time. Minor modification will even allow equal elements to be "sorted" into different ordering positions, if you prefer to read "random" that way.

An interesting endeavor. Most likely a misuse/abuse of IComparer.

You're attempting to do a random weighted sort by using a mechanism that wasn't built for that purpose.

Why not implement your own sorting routine and your own comparer? I have a feeling that even that would be insufficient.

Don't do it.

All of the algorithms proposed thus far introduce some sort of bias into the output (some bigger than others).

@Princess and @Luke propose storing a random number alongside the data. However because there is a possibility that any two of these random numbers could have the same value as another, the sort order between those two items will be deterministically biased

The worst case for this would be if the sorting routine is "stable" (that is that objects that are considered equal are always output in the same order they were input in). Array.Sort doesn't happen to be stable (it uses QuickSort internally) but there is still a bias that occurs whenever two items have the same value that depends on where they are in the input (and specifically where they are relative to the QuickSort's pivot).

As the keyspace for this random number increases, the probability of a collision goes down (with a good source of randomness), but keep in mind that as the number of values you are sorting goes up, the birthday paradox dictates that the likelyhood of at least one pair amongst them colliding goes up very quickly.

For an integer key, there are 2^32 unique values for the key and even assuming that there is a perfectly even distribution of random values, with 75,000 rows, there is a 50% probability that there will be a collision. Wikipedia.

The cryptographic hash approach that you proposed potentially has a large enough keyspace (160) bits to make the chance of a collision negligible, but your algorithm decomposes all of that randomness back down to a single int before actually doing the compare which negates the benefit of that larger keyspace.

Your best approach is to associate a distinct "sortOrder" value with each of your data items shuffle these values using a proven algorithm, and then order the results by that value.

If you are using Array.Sort, there is an overload that takes an array of "keys" and an array of "values". The keys array is sorted normally, but whenever a value in the keys array is moved, the corresponding entry in the values array is also moved.

Something like:


Something[] data;//populated somewhere
int[] keys = new int[data.Length];//or long if you might have lots of data
for(int i=0;i<keys.Length;++i) {
 keys[i] = i;
}

Shuffle(keys);

Array.Sort(keys, data);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow