Question

I have 2 collections both containing the same type of object and both collections have approximately 40K objects each.

The code for the object each collection contains is basically like a dictionary except I've overridden the equals and hash functions:

public class MyClass: IEquatable<MyClass>
{
    public int ID { get; set; }
    public string Name { get; set; }

    public override bool Equals(object obj)
    {
        return obj is MyClass && this.Equals((MyClass)obj);
    }

    public bool Equals(MyClass ot)
    {
        if (ReferenceEquals(this, ot))
        {
            return true;
        }

        return 
         ot.ID.Equals(this.ID) &&
         string.Equals(ot.Name, this.Name, StringComparison.OrdinalIgnoreCase); 
    }

    public override int GetHashCode()
    {
         unchecked
         {
             int result = this.ID.GetHashCode();
             result = (result * 397) ^ this.Name.GetSafeHashCode();
             return result;
         }
    }
}

The code I'm using to compare the collections and get the differences is just a simple Linq query using PLinq.

ParallelQuery p1Coll = sourceColl.AsParallel();
ParallelQuery p2Coll = destColl.AsParallel();

List<object> diffs = p2Coll.Where(r => !p1Coll.Any(m => m.Equals(r))).ToList();

Does anybody know of a faster way of comparing this many objects? Currently it's taking about 40 seconds +/- 2 seconds on a quad core computer. Would doing some grouping based on the data and then comparing each group of data in parallel possibly be faster? If I group the data first based on Name I would end up with about 490 unique objects and if I grouped it by ID first I would end up with about 622 unique objects.

Was it helpful?

Solution

You can use Except method which will give you every item from p2Coll that is not in p1Coll.

var diff = p2Coll.Except(p1Coll);

UPDATE (some performance testing):

Disclaimer:

Actual time depends upon multiple factors (such as content of collections, hardware, what's running on your machine, amount of hashcode collisions etc.) that's why we have complexity and Big O notation (see Daniel Brückner comment).

Here is some performance stats for 10 runs on my 4 years old machine:

Median time for Any(): 6973,97658ms
Median time for Except(): 9,23025ms

Source code for my test is available on gist.


UPDATE 2:

If you want to have different items from both first and second collection you have to actually do Expect on both and that Union the result:

var diff = p2Coll.Except(p1Coll).Union(p1Coll.Except(p2Coll));

OTHER TIPS

Intersect

int[] id1 = { 44, 26, 92, 30, 71, 38 };
int[] id2 = { 39, 59, 83, 47, 26, 4, 30 };

IEnumerable<int> both = id1.Intersect(id2);

foreach (int id in both)
    Console.WriteLine(id);

/*
This code produces the following output:

26
30
*/
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top