Determine if an IEnumerable<T> contains any object of another IEnumerable<T>

https://stackoverflow.com/questions/691145

22-08-2019
|

Question

I have 2 IEnumerable<int>

IEnumerable<int> x;
IEnumerable<int> y;

What is the best best way to determine if any int in y is present in the x?
Currently I'm using:

return x.Intersect<int>(y).Count() > 0;

Would it be significantly faster to loop through and test each individually?

foreach (int i in x)
{
    foreach (int j in y)
    {
        if (i == j) return true;
    }
}
return false;

The lists are relatively light, with not more than 50 ints in x and 4 in y if that matters in the consideration.

Solution

It would be fastest to use the Any method instead of the Count method:

return x.Intersect<int>(y).Any();

This assumes that the IEnumerable<int> implementation doesn't also implement ICollection<int>. In that case, Count (in the case where IEnumerable<T> implements ICollection<T>) is an O(N) operation while Any is always an O(1) operation. (as it only checks for a single element). However, the behavior of Count is an implementation detail, and you shouldn't rely on that.

I've written about this more in-depth in a blog post that goes into detail about when to use Count() vs. Any(). In summary:

DO use Enumerable.Any extension method to check for the existence of elements in the sequence.
DO NOT use Enumerable.Count extension method in comparisons against zero, as the following are semantically equivalent:
- sequence.Count() == 0
- !sequence.Any()
DO NOT use the Enumerable.Count extension method in comparisons against the “not zero” condition, as the following are semantically equivalent:
- sequence.Count != 0
- sequence.Any()

OTHER TIPS

EDIT: The original answer below is really dealing in terms of complexity. If the sequences are short enough, all the calls through GetNext(), building a HashSet etc will actually be more expensive than using Intersect(y).Any(). However, in that case both calls will be really pretty quick anyway.

In other words, Intersect(y).Any() definitely scales better as the two sequences get longer, but for if you're absolutely sure that the sequences are short, the nested loop solution will be faster.

Original answer

No, Intersect() will be faster than the two-loop solution - but as CasperOne wrote, Any() will be faster than Count() because it will exit as soon as it sees an element.

Assuming sequences of length N and M, Intersect will be O(N + M) whereas the two-loop solution is O(N * M).

Intersect builds up a HashSet from the "inner" sequence (this takes O(M) complexity) and then loops through the outer sequence (which takes O(N) complexity), yielding any element which is in the set. These results are streamed - the inner sequence will be evaluated when the first element is requested from Intersect(), but this only goes as far as finding the first match (if any). Using Any() you'll have an "early out" if there are any matches, so we don't need to take the overall number of matches into consideration when working out the complexity.

Streaming results from LINQ rocks - it's well worth getting your head round (as well as deferred execution).

Intersect is okay, but as others have said I wouldn't call .Count() on the result.

The reason is that Intersect does not create the intersection of the two lists. It creates an IEnumerable that is capable of enumerating that intersection, but it doesn't actually enumerate those results yet. Most of the work is deferred until the time when you finally iterate over this enumeration.

The problem with Count is that it does iterate over the entire enumeration. So not only does it always count all the results, but it causes all of the work involved in computing those results to run as well.

Calling Any instead will be very fast by comparison, because you will compute at most one intersection result before returning. Of course, in the case where there are no matches it will still need to iterate the entire list. However, that's no worse off than you were before. In fact, it's still faster because as Jon Skeet mentioned, the Intersect function uses a HashSet to compute the results rather than iterating over everything. Your best and average cases are improved tremendously.

It's like the difference between these two snippets:

int count = 0;
foreach (int i in x)
{
   foreach (int j in y)
   {
      if (i==j) count++;
   }
}
return (count > 0);

// this one should look familiar
foreach (int i in x)
{
    foreach (int j in y)
    {
       if (i==j) return true;
    }
}
return false;

Obviously the 2nd is much faster on average. The performance of Any() would be analogous (not the same as, thanks to the HashSet) the 2nd snippet, while Count() would be analogous to the first.

You're better of doing this:

return x.Intersect(y).Any();

This will abort as soon as it finds a single match, and stop enumerating through the collections.

This question and last answer are more than 1 yr old as of my answer; however, my findings differ from the accepted answer. I find that looping is significantly faster than using Intersect.Any(). Maybe my benchmark code isn't right?

Here's the code I used to find the number of iterations per second between Intersect, nested loops, and a loop with IndexOf. Please excuse the VB. ;)

Dim ArrayLength As Integer = 50
Dim doesContain As Boolean = False
Dim counter As Integer = 0
Dim sw As New System.Diagnostics.Stopwatch()
Dim BigArray1 As String() = New String(ArrayLength) {}
Dim BigArray2 As String() = New String(ArrayLength) {}
Dim rand As New Random(DateTime.Now.Millisecond)
For i As Integer = 0 To ArrayLength
    BigArray1(i) = Convert.ToChar(rand.Next(0, 255)).ToString()
    BigArray2(i) = Convert.ToChar(rand.Next(0, 255)).ToString()
Next
Dim AnEnumerable1 As IEnumerable(Of String) = BigArray1
Dim AnEnumerable2 As IEnumerable(Of String) = BigArray2

counter = 0
sw.Restart()
Do
    doesContain = False
    For Each x As String In AnEnumerable1
        For Each y As String In AnEnumerable2
            If x.Equals(y) Then
                doesContain = True
                Exit For
            End If
        Next
        If doesContain Then Exit For
    Next
    counter += 1
Loop While sw.ElapsedMilliseconds < 1000
Console.WriteLine("InnerLoop iterated:  " & counter.ToString() & " times in " & sw.ElapsedMilliseconds.ToString() & "ms.")

counter = 0
sw.Restart()
Do
    doesContain = AnEnumerable1.Intersect(AnEnumerable2).Any()
    counter += 1
Loop While sw.ElapsedMilliseconds < 1000
Console.WriteLine("Intersect iterated:  " & counter.ToString() & " times in " & sw.ElapsedMilliseconds.ToString() & "ms.")

counter = 0
sw.Restart()
Do
    For Each x As String In AnEnumerable1
        If Array.IndexOf(Of String)(BigArray2, x) <> -1 Then
            Exit For
        End If
    Next
    counter += 1
Loop While sw.ElapsedMilliseconds < 1000
Console.WriteLine("IndexOf iterated:    " & counter.ToString() & " times in " & sw.ElapsedMilliseconds.ToString() & "ms.")

My results over two 5,000,000 item arrays. Higher iterations are better:

InnerLoop iterated: 2974553 times in 1000ms.
Intersect iterated: 4 times in 1168ms.
IndexOf iterated: 4194423 times in 1000ms.

My results over two 50 item arrays. Higher iterations are better:

InnerLoop iterated: 375712 times in 1000ms.
Intersect iterated: 203721 times in 1000ms.
IndexOf iterated: 668421 times in 1000ms.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow