Question

I need to count the number of elements corresponding to the intersection of two big arrays of strings and do it very fast.

I am using the following code:

arr1[i].Intersect(arr2[j]).Count()

For CPU Time, VS Profiler indicates

  • 85.1% in System.Linq.Enumerable.Count()
  • 0.3% in System.Linq.Enumerable.Intersect()

Unfortunately it might to take hours to do all work.

How to do it faster?

Was it helpful?

Solution

You can use HashSet with arr2

HashSet<string> arr2Set = new HashSet<string>(arr2);
arr1.Where(x=>arr2Set.Contains(x)).Count();
              ------------------
                      |
                      |->HashSet's contains method executes quickly using hash-based lookup..

Not considering the conversion from arr2 to arr2Set ,this should be O(n)

OTHER TIPS

I suspect the reason why the profiler shows the time being consumed in Count, is that this is where the collection is actually enumerated (the Intersect is lazily evaluated and does not run before you need the result).

I believe Intersect should have some internal optimizations to make this reasonably fast, but you could try using a HashSet<string> so you are sure the intersect can be made without searching through the inner array for each element:

HashSet<string> set = new HashSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;

Hmmm Intersect is probably N^2

to make it faster quicksort both arrays. and than traverse both arrays. counting intersections.

too lazy to test how fast it would be but should O(nlogn +n)

    using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            const int arrsize = 1000000;
            Random rnd = new Random(42);
            string[] arr1 = new string[arrsize];
            string[] arr2 = new string[arrsize];
            for (int i = 0; i < arrsize; i++)
            {
                arr1[i] = rnd.Next().ToString();
                arr2[i] = rnd.Next().ToString();
            }
            {
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                arr1.Intersect(arr2).Count();
                Console.WriteLine("array" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

        {

            HashSet<string> set = new HashSet<string>(arr1);
            var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
            set.IntersectWith(arr2);
            int count = set.Count;
            Console.WriteLine("HashSet" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
        }
            {
               var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                HashSet<string> set = new HashSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("HashSet + new" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

            {
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                SortedSet<string> set = new SortedSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet +new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

            {

                SortedSet<string> set = new SortedSet<string>(arr1);
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet without new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }
        }
    }
}

results

array 914,637

HashSet 816,119

HashSet +new 1,150,978

SortedSet +new 16,173,836

SortedSet without new 7,946,709

so seems that best way is to keep a ready hash set.

when you are working with sets, your complexity will be O((n log n)*(m log m)) or so,

i think this here should be faster, but i'm not sure if it is now O((n log n)+(m log m))

possible would be 
var Set1 = arr1[i].Distinct().ToArray(); // if necessary, if arr1 or arr2 could be not distinct
var Set2 = arr2[j].Distinct().ToArray();  

nCount = Set1.Count() + Set2.Count() - Set1.Append(Set2).Distinct().Count();

Build a HashSet using the smaller array and then loop through the bigger one, incrementing a counter if the item it exists in the hashset.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top