What advantage was gained by implementing LINQ in a way that does not cache the results?

https://softwareengineering.stackexchange.com/questions/368195

31-01-2021
|

Pergunta

This is a known pitfall for people who are getting their feet wet using LINQ:

public class Program
{
    public static void Main()
    {
        IEnumerable<Record> originalCollection = GenerateRecords(new[] {"Jesse"});
        var newCollection = new List<Record>(originalCollection);

        Console.WriteLine(ContainTheSameSingleObject(originalCollection, newCollection));
    }

    private static IEnumerable<Record> GenerateRecords(string[] listOfNames)
    {
        return listOfNames.Select(x => new Record(Guid.NewGuid(), x));
    }

    private static bool ContainTheSameSingleObject(IEnumerable<Record>
            originalCollection, List<Record> newCollection)
    {
        return originalCollection.Count() == 1 && newCollection.Count() == 1 &&
                originalCollection.Single().Id == newCollection.Single().Id;
    }

    private class Record
    {
        public Guid Id { get; }
        public string SomeValue { get; }

        public Record(Guid id, string someValue)
        {
            Id = id;
            SomeValue = someValue;
        }
    }
}

This will print "False", because for each name supplied to create the original collection, the select function keeps getting reevaluated, and the resulting Record object is created anew. To fix this, a simple call to ToList could be added at the end of GenerateRecords.

What advantage did Microsoft hope to gain by implementing it this way?

Why wouldn't the implementation simply cache the results an internal array? One specific part of what's happening may be deferred execution, but that could still be implemented without this behavior.

Once a given member of a collection returned by LINQ has been evaluated, what advantage is provided by not keeping an internal reference/copy, but instead recalculating the same result, as a default behavior?

In situations where there is a particular need in the logic for the same member of a collection recalculated over and over, it seems like that could be specified through an optional parameter and that the default behavior could do otherwise. In addition, the speed advantage that is gained by deferred execution is ultimately cut back against by the time it takes to continually recalculate the same results. Finally this is confusing block for those that are new to LINQ, and it could lead to subtle bugs in ultimately anyone's program.

What advantage is there to this, and why did Microsoft make this seemingly very deliberate decision?

Solução

What advantage was gained by implementing LINQ in a way that does not cache the results?

Caching the results would simply not work for everybody. As long as you have tiny amounts of data, great. Good for you. But what if your data is larger than your RAM?

It has nothing to do with LINQ, but with the IEnumerable<T> interface in general.

It is the difference between File.ReadAllLines and File.ReadLines. One will read the whole file into RAM, and the other will give it to you line by line, so you can work with large files (as long as they have line-breaks).

You can easily cache everything you want to cache by materializing your sequence calling either .ToList() or .ToArray() on it. But those of us who do not want to cache it, we have a chance to not do so.

And on a related note: how do you cache the following?

IEnumerable<int> AllTheZeroes()
{
    while(true) yield return 0;
}

You cannot. That's why IEnumerable<T> exists as it does.

Outras dicas

What advantage did Microsoft hope to gain by implementing it this way?

Correctness? I mean, the core enumerable can change in between calls. Caching it would produce incorrect results and open the entire “when/how do I invalidate that cache?” Can of worms.

And if you consider LINQ was originally designed as a means to do LINQ to data sources (like entity framework, or SQL directly), the enumerable was going to change since that’s what databases do.

On top of that, there is Single Responsibility Principle concerns. It is far easier to make some query code that works and build caching on top of it than to build code that queries and caches but then remove the caching.

Because LINQ is, and was intended from the beginning to be, a generic implementation of the Monad pattern popular in functional programming languages, and a Monad is not constrained to always yield the same values given the same sequence of calls (in fact, its use in functional programming is popular precisely because of this property, which allows for escaping the deterministic behaviour of pure functions).

Another reason that hasn't been mentioned is, the possibility of concatenating different filters and transformations without creating garbage middle results.

Take this for example:

cars.Where(c => c.Year > 2010)
.Select(c => new { c.Model, c.Year, c.Color })
.GroupBy(c => c.Year);

If the LINQ methods computed the results immediately, we would have 3 collections:

Where result
Select result
GroupBy result

Of which we only care about the last one. There's no point in saving the middle results because we don't have access to them, and we only want to know about the cars already filtered and grouped by year.

If there was a need to save any of these results, the solution is simple: break the calls apart and call .ToList() on them and save them in a variable.

Just as a side note, in JavaScript, the Array methods actually return the results immediately, which can lead to more memory consumption if one isn't careful.

Fundamentally, this code — putting a Guid.NewGuid () inside a Select statement — is highly suspicious. This is surely a code smell of some kind!

In theory, we would not necessarily expect a Select statement to create new data but to retrieve existing data. While it is reasonable for Select to join data from multiple sources to produce joined content of different shape or even compute additional columns, we might still expect it to be functional & pure. Putting the NewGuid () inside makes it non-functional & non-pure.

The creation of the data could be teased apart from the select and put into a create operation of some sort, so that the select can be remain pure and re-useable, or else the select should be done only once and wrapped/protected — this is the .ToList () suggestion.

However, to be clear, the issue seems to me the mixing of creation inside selection rather than lack of caching. Putting the NewGuid() inside the select appears to me to be an inappropriate mixing of programming models.

Deferred execution allows those writing LINQ code (to be precise, using IEnumerable<T>) to explicitly choose whether the result is immediately computed and stored in memory, or not. In other words, it allows programmers to choose the calculation time versus storage space tradeoff that is most appropriate to their application.

It could be argued that the majority of applications want the results immediately, so that should have been the default behaviour of LINQ. But there are numerous other APIs (e.g. List<T>.ConvertAll) that offer this behaviour and have done since the Framework was created, whereas until LINQ was introduced, there was no way to have deferred execution. Which, as other answers have demonstrated, is a prerequisite for enabling certain types of computations that would otherwise by impossible (by exhausting all available storage) when using immediate execution.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange