Faster way to do a List<T>.Contains()

https://stackoverflow.com/questions/640902

22-07-2019
|

Question

I am trying to do what I think is a "de-intersect" (I'm not sure what the proper name is, but that's what Tim Sweeney of EpicGames called it in the old UnrealEd)

// foo and bar have some identical elements (given a case-insensitive match)
List‹string› foo = GetFoo();
List‹string› bar = GetBar();

// remove non matches
foo = foo.Where(x => bar.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();
bar = bar.Where(x => foo.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();

Then later on, I do another thing where I subtract the result from the original, to see which elements I removed. That's super-fast using .Except(), so no troubles there.

There must be a faster way to do this, because this one is pretty bad-performing with ~30,000 elements (of string) in either List. Preferably, a method to do this step and the one later on in one fell swoop would be nice. I tried using .Exists() instead of .Contains(), but it's slightly slower. I feel a bit thick, but I think it should be possible with some combination of .Except() and .Intersect() and/or .Union().

Solution

With intersect it would be done like this:

var matches = ((from f in foo 
                select f)
              .Intersect(
                  from b in bar 
                  select b, StringComparer.InvariantCultureIgnoreCase))

OTHER TIPS

This operation can be called a symmetric difference.

You need a different data structure, like a hash table. Add the intersection of both sets to it, then difference the intersection from each set.

UPDATE:

I got a bit of time to try this in code. I used HashSet<T> with a set of 50,000 strings, from 2 to 10 characters long with the following results:

Original: 79499 ms

Hashset: 33 ms

BTW, there is a method on HashSet called SymmetricExceptWith which I thought would do the work for me, but it actually adds the different elements from both sets to the set the method is called on. Maybe this is what you want, rather than leaving the initial two sets unmodified, and the code would be more elegant.

Here is the code:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

class Program
{
    static void Main(string[] args)
    {
        // foo and bar have some identical elements (given a case-insensitive match)
        var foo = getRandomStrings();
        var bar = getRandomStrings();

        var timer = new Stopwatch();

        timer.Start();
        // remove non matches
        var f = foo.Where(x => !bar.Contains(x)).ToList();
        var b = bar.Where(x => !foo.Contains(x)).ToList();
        timer.Stop();

        Debug.WriteLine(String.Format("Original: {0} ms", timer.ElapsedMilliseconds));

        timer.Reset();

        timer.Start();
        var intersect = new HashSet<String>(foo);
        intersect.IntersectWith(bar);

        var fSet = new HashSet<String>(foo);
        var bSet = new HashSet<String>(bar);

        fSet.ExceptWith(intersect);
        bSet.ExceptWith(intersect);
        timer.Stop();

        var fCheck = new HashSet<String>(f);
        var bCheck = new HashSet<String>(b);

        Debug.WriteLine(String.Format("Hashset: {0} ms", timer.ElapsedMilliseconds));

        Console.WriteLine("Sets equal? {0} {1}", fSet.SetEquals(fCheck), bSet.SetEquals(bCheck)); //bSet.SetEquals(set));
        Console.ReadKey();
    }

    static Random _rnd = new Random();

    private const int Count = 50000;

    private static List<string> getRandomStrings() 
    {
        var strings = new List<String>(Count);

        var chars = new Char[10];

        for (var i = 0; i < Count; i++)
        {
            var len = _rnd.Next(2, 10);

            for (var j = 0; j < len; j++)
            {
                var c = (Char)_rnd.Next('a', 'z');
                chars[j] = c;
            }

            strings.Add(new String(chars, 0, len));
        }

        return strings;
    }
}

If the elements are unique within each list you should consider using an HashSet

The HashSet(T) class provides high performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order.

With sorted list, you can use binary search.

Contains on a list is an O(N) operation. If you had a different data structure, such as a sorted list or a Dictionary, you would dramatically reduce your time. Accessing a key in a sorted list is usually O(log N) time, and in a hash is usually O(1) time.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow