Performant intersection and distinct element extraction?

https://stackoverflow.com/questions/1679258

16-09-2019
|

Question

I have a line like the following in my code:

potentialCollisionsX.Intersect(potentialCollisionsY).Distinct().ToList();

Which, through profiling, i have determined that it is eating approximately 56 percent of my time. I need to figure out how to provide a efficient implementation. I tried

        List<Extent> probableCollisions = new List<Extent>();
        for (int j = 0; j < potentialCollisionsX.Count; j++)
        {
            if (potentialCollisionsY.Contains(potentialCollisionsX[j]) && !probableCollisions.Contains(potentialCollisionsX[j]))
            {
                probableCollisions.Add(potentialCollisionsX[j]);
            }
        }

but that only drops it to 42 percent. Optimizations or alternative ideas would be much appreciated.

Edit: Someone requested information about the Extent class, and i can't think of a better way to give them information than providing the class definition.

    private enum ExtentType { Start, End }
    private sealed class Extent
    {
        private ExtentType _type;
        public ExtentType Type
        {
            get
            {
                return _type;
            }
            set
            {
                _type = value;
                _hashcode = 23;
                _hashcode *= 17 + Nucleus.GetHashCode();
            }
        }
        private Nucleus _nucleus; //Nucleus is the main body class in my engine
        public Nucleus Nucleus
        {
            get
            {
                return _nucleus;
            }
            set
            {
                _nucleus = value;
                _hashcode = 23;
                _hashcode *= 17 + Nucleus.GetHashCode();
            }
        }

        private int _hashcode;

        public Extent(Nucleus nucleus, ExtentType type)
        {
            Nucleus = nucleus;
            Type = type;
            _hashcode = 23;
            _hashcode *= 17 + Nucleus.GetHashCode();
        }

        public override bool Equals(object obj)
        {
            return Equals(obj as Extent);
        }
        public bool Equals(Extent extent)
        {
            if (this.Nucleus == extent.Nucleus) //nucleus.Equals does an int comparison
            {
                return true;
            }
            return false;
        }
        public override int GetHashCode()
        {
            return _hashcode;
        }
    }

Edit2: It would seem that using hashsets makes this part of my code as performant as i need, so thanks for your guy's help!

Solution

Try this:

HashSet<Extent> result = new HashSet<Extent>();
HashSet<Extent> potentialSetY = new HashSet<Extent>(potentialCollisionsY);
foreach (Extent ex in potentialCollisionsX)
    if (potentialSetY.Contains(ex))
        result.Add(ex);

Hash sets are good at doing Contains quickly, but don't preserve order

If you need to preserve order, here's something a little more complicated: An ordered hash set. It uses normal hash set semantics (well, a dictionary, but it's the same thing), but before enumeration it reorders the items according to the insertion order.

// Unchecked code

public class OrderedHashSet<T> : IEnumerable<T> {
    int currentIndex = 0;
    Dictionary<T, index> items = new Dictionary<T, index>();

    public bool Add(T item) {
        if (Contains(item))
            return false;
        items[item] = currentIndex++;
        return true;
    }

    public bool Contains(T item) {
        return items.ContainsKey(item);
    }

    public IEnumerator<T> GetEnumerator() {
        return items.Keys.OrderBy(key => items[key]).GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator() {
        return GetEnumerator();
    }
}

Now simply change HashSet to OrderedHashSet in the above sample and it should work.

OTHER TIPS

Intersect returns distinct elements anyway, making the call to Distinct() unnecessary. That will be eating up at least some of your time.

Also, do you actually need to call ToList? What are you then doing with the result?

Does the order matter? If not, you should consider using a HashSet<T> instead of a List<T> for your "manual" code. (And probably create a HashSet<T> for potentialCollisionsY as well.) This will make the Contains call faster, at least if the collections are large enough...

By the way, don't believe the documentation for Intersect - it's wrong about the order of operations (at least in .NET 3.5)

OK, I see the definition of the Extent class. First of all, it violates the rule that if obj1.Equals(obj2)==true then obj1.GetHashCode()==obj2.GetHashCode(). But that's besides the point and can be fixed (if you won't, the algorithms which depend on hashing, like a HashSet will fail).

Now, if the only operation that one can do on the Extent object is to compare for equality, then it will not be possible to get the worst case performance above O(N*M) (where N is the size of first collection, and M is the size of the second collection). That's because you will ultimately have to compare every element with every element.

This can be made better by the use of GetHashCode() and the fact that objects with different hash codes will also be different themselves. Other people have suggested to use the HashSet class, that would be such a solution. The best case performance in this case would be O(N+M), and the worst case - O(N+N*M). On average though you should win, unless the GetHashCode() method is very poorly implemented and returns the same hash codes for many objects.

I myself prefer a more stable solution. If the extent class could be sorted reliably (that is, if you could compare two Extent objects to see which one was bigger and which one was smaller), then you could sort both lists and the performance could be brought down to O(sorting+M+N). The idea is that when the lists are sorted, you can go through them both simultaneously and look for equal elements there.

Now the sorting performance is the tricky thing here. If you only implement the comparison operation (as in, the IComparable interface), you will be able to sort both lists in time O(N*logN+M*logM). The standard List.Sort() method should do that for you. All in all, the total performance would be O(N*logN+M*logM+N+M). You should note however that this uses the QuickSort algorithm which performs poorly on nearly-sorted lists. The worst case is a completely sorted list in which case it is O(N*M). If your lists are close to being sorted already, you should consider another sorting algorithm (and implement it yourself).

The ultimate in reliable speed would be if you could convert each Extent to an integer (or more generally, some string) with the property that if the strings are equal, the Extents are equal as well, and if the strings are not equal, then the Extents are not equal either. The thing with strings is that they can be sorted in linear time with algorithms like radix sort, radix tree, etc. Then the sorting would take only the time of O(N+M). In fact, if you constructed a Radix tree, you would only have to sort the first list and you could search for strings in it directly (with every search taking O(1) time). All in all, the total performance would be O(N+M) which is the best available.

One thing you should always keep in mind though - big algorithms have big constants. The radix approach might look the best on paper, but it will be quite tricky to implement and generally slower than the simpler approaches for small amounts of data. Only if your lists have elements in the ranges of thousands and tens of thousands should you start to think about this. Also, these algorithms require to create a lot of new objects and the cost of each new() operation becomes significant as well. You should think carefully to minimize the number of allocations required.

If you can't come up with a better solution, consider using unmanaged code as a last resort.

Two approaches:

Put the items in a hashmap if they are not there already, else mark them in the hashmap as duplicated. This is O(n). You then iterate over all items in the hashmap and see if they are marked as duplicate or not - O(n) again.

Another approach:

Sort the two lists. This is an O(n lg n) operation, but crucially it might be that you can happily maintain the two lists sorted at all times, and therefore the cost is not taken when specifically looking for the intersection etc.

Then go through the two lists in order, finding the distinct and duplicate etc entries. This is O(n).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow