Question

CHECK UPDATE 3 below I found out the issue I ran into is related to a known serious problem with c# string comparers for .Net 4.0, 4.0 client and 4.5, that will lead to inconsistent sort order of lists of strings (causing the output to depend on the order in the input and the sort algorithm used). The problem was reported to Microsoft in Dec 2012 and closed as "won't be fixed". A work around is available, but is so much slower that it is hardly practical for large collections.

While implementing an immutable PatriciaTrie, I wanted to compare its performance with a System.Collections.Generic.SortedList. I used the following file https://github.com/rkapsi/patricia-trie/blob/master/src/test/resources/org/ardverk/collection/hamlet.txt to create an input wordlist for testing.

When inserting each of the words in the c# SortedList, using either Comparer<string>.Default or StringComparer.InvariantCulture as the key comparer, a number of the entries that are successfully inserted cannot be retrieved using the normal search methods (e.g. ContainsKey returns false) but the key is present in the list as is observed by iterating the list.

Even more curious, the comparer returns the value '0' when comparing the key retrieved from the sorted list with the search key that could not be found using ContainsKey.

The complete example below demonstrates this issue on my system.

using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;

class Program
{
    static void Main(string[] args)
    {
        // the problem is possibly related to comparison.
        var fail = true;
        var comparer = fail ? StringComparer.InvariantCulture : StringComparer.Ordinal;

        // read hamlet (contains duplicate words)
        var words = File
            .ReadAllLines("hamlet.txt")
            .SelectMany(l => l.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries))
            .Select(w => w.Trim())
            .Where(w => !string.IsNullOrEmpty(w))
            .Distinct(comparer)
            .ToArray();

        // insert hamlet's words in the sorted list.
        var list = new SortedList<string, int>(comparer);
        var ndx = 0;
        foreach (var word in words)
            list[word] = ndx++;

        // search for each of the added words.
        foreach (var keyToSearch in words)
        {
            if (!list.ContainsKey(keyToSearch))
            {
                // was inserted, but cannot be retrieved.
                Console.WriteLine("Error - Key not found: \"{0}\"", keyToSearch);
                // however when we iterate over the list, we see that the entry is present
                var prefix = keyToSearch.Substring(0, Math.Min(keyToSearch.Length, 3));
                foreach (var wordCloseToSearchKey in list.Keys.Where(s => s.StartsWith(prefix)))
                {
                    // and using the SortedList's supplied comparison returns 0, signaling equality
                    var comparisonResult = list.Comparer.Compare(wordCloseToSearchKey, keyToSearch);
                    Console.WriteLine("{0} - comparison result = {1}", wordCloseToSearchKey, comparisonResult);
                }
            }
        }

        // Check that sort order of List.Keys is correct 
        var keys = list.Keys.ToArray();
        BinarySearchAll("list.Keys", keys, list.Comparer);
        CheckCorrectSortOrder("list.Keys", keys, list.Comparer);

        // Check that sort order of Array.Sort(List.Keys) is correct 
        var arraySortedKeys = CopySortSearchAndCheck("Array.Sort(List.Keys)", keys, list.Comparer);

        // Check that sort order of the Array.Sort(input words) is correct 
        var sortedInput = CopySortSearchAndCheck("Array.Sort(input words)", words, list.Comparer);

        Console.ReadLine();
    }

    static string[] CopySortSearchAndCheck(string arrayDesc, string[] input, IComparer<string> comparer)
    {
        // copy input
        var sortedInput = new string[input.Length];
        Array.Copy(input, sortedInput, sortedInput.Length);

        // sort it
        Array.Sort(sortedInput, comparer);

        // check that we can actually find the keys in the array using bin. search
        BinarySearchAll(arrayDesc, sortedInput, comparer);

        // check that sort order is correct
        CheckCorrectSortOrder(arrayDesc, sortedInput, comparer);

        return sortedInput;
    }

    static void BinarySearchAll(string arrayDesc, string[] sortedInput, IComparer<string> comparer)
    {
        // check that each key in the input can be found using bin. search
        foreach (var word in sortedInput)
        {
            var ix = Array.BinarySearch(sortedInput, word, comparer);
            if (ix < 0)
                // and it appears it cannot!
                Console.WriteLine("Error - {0} - Key not found: \"{1}\"", arrayDesc, word);
        }
    }

    static void CheckCorrectSortOrder(string arrayDesc, string[] sortedKeys, IComparer<string> comparer)
    {
        for (int n = 0; n < sortedKeys.Length; n++)
        {
            for (int up = n + 1; up < sortedKeys.Length; up++)
            {
                var cmp = comparer.Compare(sortedKeys[n], sortedKeys[up]);
                if (cmp >= 0)
                {
                    Console.WriteLine(
                        "{0}[{1}] = \"{2}\" not < than {0}[{3}] = \"{4}\"  - cmp = {5}",
                        arrayDesc, n, sortedKeys[n], up, sortedKeys[up], cmp);
                }
            }
            for (int down = n - 1; down > 0; down--)
            {
                var cmp = comparer.Compare(sortedKeys[n], sortedKeys[down]);
                if (cmp <= 0)
                {
                    Console.WriteLine(
                        "{0}[{1}] = \"{2}\" not > than {0}[{3}] = \"{4}\"  - cmp = {5}",
                        arrayDesc, n, sortedKeys[n], down, sortedKeys[down], cmp);
                }
            }
        }
    }
}

Does anyone have an explanation for this unexpected and odd behaviour?

When changing the comparer used by the SortedList to StringComparer.Ordinal (by e.g. changing fail to false in the above example) the problem disappears, which seems to point to a comparison issue, but I don't quite understand why.

UPDATED As noted by Sébastien the issue described here does not show up on .Net 3.5 and 3.5 client profiles. It does on .Net 4.0, 4.0 client and 4.5.

After some more digging, I noticed that if I take the sorted keys from the list and I run Array.BinarySearch on those keys, it also returns negative (not found) values for the same keys that are not found using SortedList.ContainsKey. Thus this would suggest that the sort order of the keys is not correct.

If I take the already sorted keys from the list and sort them using Array.Sort the sort order of the output is different for the keys that were problematic.

So I added a function to check (using the list's comparer) if the sort order of a given array is correct (i.e. a preceding key is always smaller, a succeeding key is always larger) and restricted the input to words that are distinct according to the comparer. I applied this function on 3 different inputs (all using the same comparer):

  1. the SortedList's Keys collection.
  2. the output of Array.Sort on these keys.
  3. the output of Array.Sort on the input from the file.

The output of (2) and (3) is identical and different from (1). However performing Array.BinarySearch on the Array.Sort outputs of (2) and (3) again fails to find the same keys (returning < 0). Also the function that checks for correct sort order indicates that for all 3 cases, the sort order around the involved problematic keys is not correct.

At this point I am just hoping I did something incredibly stupid and there is a simple explanation. Hopefully someone can point that out to me.

The example code is updated with my additional troubleshooting experiments, a screenshot of the output can be found here http://imgur.com/DU8SCsA.

UPDATE 2 Ok, I have narrowed the problem down to what seems to me like a very serious problem with c# string comparers introduced as of .Net 4.0.

In summary, suppose we have 3 values: a1, a2 and a3. For any kind of sorting to work correctly, we would expect that if a1 < a2 and a2 < a3 that in order for comparison to be consistent, as a consequence the following also holds: a1 < a3.

This is however not the case with the c# string comparers (at least for Comparer<string>.Default and StringComparer.InvariantCulture).

The little program below illustrates this exact problem:

class Program
{
    static void Main(string[] args)
    {
        var comparer = StringComparer.InvariantCulture;
        var a1 = "A";
        var a2 = "a\'";
        var a3 = "\'a";
        PrintComparison("a1", a1, "a2", a2, comparer);
        PrintComparison("a2", a2, "a3", a3, comparer);
        PrintComparison("a1", a1, "a3", a3, comparer);
        Console.ReadLine();
    }
    public static void PrintComparison(string firstSymbol, string first, string secondSymbol, string second, IComparer<string> comparer)
    {
        var cmp = comparer.Compare(first, second);
        var result = cmp == 0 ? "=" : cmp < 0 ? "<" : ">";
        Console.WriteLine("{0} {1} {2}   ({3} {1} {4})", firstSymbol, result, secondSymbol, first, second);
    }
}

This is its output:

a1 < a2   (A < a')
a2 < a3   (a' < 'a)
a1 > a3   (A > 'a)

The conclusion would seem to be that it is not safe to rely on sort order determined using c# string comperators, or am I missing something?

UPDATE 3 This issue seems to have been reported to MS in December 2012, and is closed with status "won't be fixed", which is rather disappointing; refer to link posted in comments below (it appears I can't post in here due to my limited reputation points). This also lists a workaround, that I have implemented and used to verify that this indeed resolves the problems observed with the standard comparers.

public class WorkAroundStringComparer : StringComparer
{
    private static readonly Func<CompareInfo, string, CompareOptions, int> _getHashCodeOfString;
    private readonly CompareInfo _compareInfo;
    private readonly CompareOptions _compareOptions;

    static WorkAroundStringComparer()
    {
        // Need this internal method to compute hashcode
        // as an IEqualityComparer implementation.
        _getHashCodeOfString = BuildGetHashCodeOfStringDelegate();
    }

    static Func<CompareInfo, string, CompareOptions, int> BuildGetHashCodeOfStringDelegate()
    {
        var compareInfoType = typeof(CompareInfo);
        var argTypes = new[] { typeof(string), typeof(CompareOptions) };
        var flags = BindingFlags.NonPublic | BindingFlags.Instance;
        var methods = compareInfoType.GetMethods(flags).ToArray(); ;
        var method = compareInfoType.GetMethod("GetHashCodeOfString", flags, null, argTypes, null);

        var instance = Expression.Parameter(compareInfoType, "instance");
        var stringArg = Expression.Parameter(typeof(string), "string");
        var optionsArg = Expression.Parameter(typeof(CompareOptions), "options");
        var methodCall = Expression.Call(instance, method, stringArg, optionsArg);
        var expr = Expression.Lambda<Func<CompareInfo, string, CompareOptions, int>>(methodCall, instance, stringArg, optionsArg);
        return expr.Compile();
    }

    public WorkAroundStringComparer()
        : this(CultureInfo.InvariantCulture)
    {
    }

    public WorkAroundStringComparer(CultureInfo cultureInfo, CompareOptions compareOptions = CompareOptions.None)
    {
        if (cultureInfo == null)
            throw new ArgumentNullException("cultureInfo");
        this._compareInfo = cultureInfo.CompareInfo;
        this._compareOptions = compareOptions;
    }

    public override int Compare(string x, string y)
    {
        if (ReferenceEquals(x, y))
            return 0;
        if (ReferenceEquals(x, null))
            return -1;
        if (ReferenceEquals(y, null))
            return 1;

        var sortKeyFor_x = _compareInfo.GetSortKey(x, _compareOptions);
        var sortKeyFor_y = _compareInfo.GetSortKey(y, _compareOptions);
        return SortKey.Compare(sortKeyFor_x, sortKeyFor_y);
    }

    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    public override int GetHashCode(string obj)
    {
        return _getHashCodeOfString(_compareInfo, obj, _compareOptions);
    }
}

The problem with this workaround is that it is hardly practical for sizable collections, because it is an order of magnitude slower than e.g. StringComparer.InvariantCulture.

Time taken when sorting the given word list 1000 times using both comparers:

StringComparer.InvariantCulture : 00:00:15.3120013
WorkAroundStringComparer        : 00:01:35.8322409

So I am still hoping that either Microsoft reconsiders or someone knows a viable alternative. Otherwise the only option that remains is falling back on using StringComparer.Ordinal.

No correct solution

OTHER TIPS

Could it be related to .Net Framework 4/4.5? I have adapted your example for .Net 3.5 like this:

var words = ReadFile("hamlet.txt");

//...

private static string[] ReadFile(string path)
{
    List<string> lines = new List<string>();
    using (StreamReader sr = new StreamReader(path))
    {
        string text = sr.ReadToEnd();
        lines.Add(text);
    }

    return lines.SelectMany(l => l.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries).Select(w => w.Trim()))
        .Where(w => !(w.ToCharArray().All(c => c == ' ')))
        .ToArray();
}

And both comparers work fine on XP using .Net 3.5.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top