Question

More than about LINQ to [insert your favorite provider here], this question is about searching or filtering in-memory collections.

I know LINQ (or searching/filtering extension methods) works in objects implementing IEnumerable or IEnumerable<T>. The question is: because of the nature of enumeration, is every query complexity at least O(n)?

For example:

var result = list.FirstOrDefault(o => o.something > n);

In this case, every algorithm will take at least O(n) unless list is ordered with respect to 'something', in which case the search should take O(log(n)): it should be a binary search. However, If I understand correctly, this query will be resolved through enumeration, so it should take O(n), even in list was previously ordered.

  • Is there something I can do to solve a query in O(log(n))?
  • If I want performance, should I use Array.Sort and Array.BinarySearch?
Was it helpful?

Solution

Even with parallelisation, it's still O(n). The constant factor would be different (depending on your number of cores) but as n varied the total time would still vary linearly.

Of course, you could write your own implementations of the various LINQ operators over your own data types, but they'd only be appropriate in very specific situations - you'd have to know for sure that the predicate only operated on the optimised aspects of the data. For instance, if you've got a list of people that's ordered by age, it's not going to help you with a query which tries to find someone with a particular name :)

To examine the predicate, you'd have to use expression trees instead of delegates, and life would become a lot harder.

I suspect I'd normally add new methods which make it obvious that you're using the indexed/ordered/whatever nature of the data type, and which will always work appropriately. You couldn't easily invoke those extra methods from query expressions, of course, but you can still use LINQ with dot notation.

OTHER TIPS

Yes, the generic case is always O(n), as Sklivvz said.

However, many LINQ methods special case for when the object implementing IEnumerable actually implements e.g. ICollection. (I've seen this for IEnumerable.Contains at least.)

In practice this means that LINQ IEnumerable.Contains calls the fast HashSet.Contains for example if the IEnumerable actually is a HashSet.

IEnumerable<int> mySet = new HashSet<int>();

// calls the fast HashSet.Contains because HashSet implements ICollection.
if (mySet.Contains(10)) { /* code */ }

You can use reflector to check exactly how the LINQ methods are defined, that is how I figured this out.

Oh, and also LINQ contains methods IEnumerable.ToDictionary (maps key to single value) and IEnumerable.ToLookup (maps key to multiple values). This dictionary/lookup table can be created once and used many times, which can speed up some LINQ-intensive code by orders of magnitude.

Yes, it has to be, because the only way of accessing any member of an IEnumerable is by using its methods, which means O(n).

It seems like a classic case in which the language designers decided to trade performance for generality.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top