I need to speed up the following Linq query

Question 1

Step 1: Establish a business case

The first thing we need to do is ask "How fast does it need to be?", because if we don't know how fast it needs to be we can't know when we're done. This isn't a technical decision, it's a business one. You need a stakeholder-centric measure of "Fast Enough" to aim for, and you need to bear in mind that Fast Enough is fast enough. We aren't looking for "As Fast As Possible" unless there's a business reason for it. Even then, we're normally looking for "As Fast As Possible Within Budget".

Since you're my stakeholder, and you don't seem to be too upset about the performance of your stored procedure, let's use that as a benchmark!

Step 2: Measure

The next thing we need to do is measure our system to see if we're Fast Enough.

Thankfully you've already measured (though we'll talk more about this later). Your stored procedure runs in 0.5 seconds! Is that Fast Enough? Yes it is! Job done!

There is no justification for continuing to spend your time (and your boss' money) fixing something that isn't broken. You probably have something better to be doing, so go do that! :D

Still here? Ok then. I'm not on the clock, people are badmouthing tech I like, and optimising Entity Framework queries is fun. Challenge Accepted!

Step 3: Inspect

So what's going on? Why is our query so slow?

To answer that question, I'm going to need to make some assumptions about your model:-

public class Foo
{
    public int Id { get; set; }

    public int BarId { get; set; }

    public virtual Bar Bar { get; set; }
}

public class Bar
{
    public int Id { get; set; }

    public string Value { get; set; }

    public virtual ICollection<Foo> Foos { get; set; }
}

Now that we've done that, we can have a look at the horrible query that Entity Framework is making for us:-

using (var context = new FooContext())
{
    context.Database.Log = s => Console.WriteLine(s);

    var query = context.Foos.FirstOrDefault(x => x.Id == 1).Bar.Value;
}

I can see from the log that TWO queries are being run:-

SELECT TOP (1)
[Extent1].[Id] AS [Id],
[Extent1].[BarId] AS [BarId]
FROM [dbo].[Foos] AS [Extent1]
WHERE 1 = [Extent1].[Id]

SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Value] AS [Value]
FROM [dbo].[Bars] AS [Extent1]
WHERE [Extent1].[Id] = @EntityKeyValue1

Wait, what? Why is stupid Entity Framework making two round-trips to the database when all we need is one string?

Step 4: Analyze

Let's take a step back and look at our query again:-

var query = context.Foos.FirstOrDefault(x => x.Id == 1).Bar.Value;

Given what we know about Deferred Execution what can we deduce is going on here?

What deferred execution basically means is that as long as you're working with an IQueryable, nothing actually happens - the query is built up in memory and not actually executed until later. This is useful for a number of reasons - in particular it lets us build up our queries in a modular fashion then run the composed query once. Entity Framework would be pretty useless if context.Foos loaded the entire Foo table into memory immediately!

Our queries only get run when we ask for something other than an IQueryable, e.g. with .AsEnumerable(), .ToList(), or especially .GetEnumerator() etc. In this case .FirstOrDefault() doesn't return an IQueryable, so this triggers the database call much earlier than we presumably intended.

The query we've made is basically saying:-

Get the first Foo with Id == 1 (or null if there aren't any)
Now Lazy Load that Foo's Bar
Now tell me that Bar's Value

Wow! So not only are we making two round-trips to the database, we're also sending the entire Foo and Bar across the wire! That's not so bad when our entities are tiny like the contrived ones here, but what if they were larger realistic ones?

Step 5: Optimize

As you've hopefully gleaned from the above, the first two rules of optimisation are 1) "Don't" and 2) "Measure first" The third rule of optimisation is "Avoid unnecessary work". An extra round-trip and a whole bunch of spurious data definitely counts as "unnecessary", so let's do something about that:-

Attempt 1

The first thing we want to do is try the declarative approach. "Find me the value of the first Bar that has a Foo with Id == 1".

This is usually the clearest option from a maintainability point of view; the intent of the programmer is obviously captured. However, remembering that we want to delay execution as long as possible, let's pop the .FirstOrDefault() after the .Select():-

var query = context.Bars.Where(x => x.Foos.Any(y => y.Id == 1))
                        .Select(x => x.Value)
                        .FirstOrDefault();

SELECT TOP (1)
[Extent1].[Value] AS [Value]
FROM [dbo].[Bars] AS [Extent1]
WHERE  EXISTS (SELECT
    1 AS [C1]
    FROM [dbo].[Foos] AS [Extent2]
    WHERE ([Extent1].[Id] = [Extent2].[BarId]) AND (1 = [Extent2].[Id])
)

Attempt 2

In both SQL and most O/RMs, a useful trick is to make sure you're querying from the correct "end" of any given relationship. Sure, we're looking for a Bar, but we've got the Id of a Foo, so we can rewrite the query with that as a starting point: "Find me the Value of the Bar of the Foo with Id == 1":-

var query = context.Foos.Where(x => x.Id == 1)
                        .Select(x => x.Bar.Value)
                        .FirstOrDefault();

SELECT TOP (1)
[Extent2].[Value] AS [Value]
FROM  [dbo].[Foos] AS [Extent1]
INNER JOIN [dbo].[Bars] AS [Extent2] ON [Extent1].[BarId] = [Extent2].[Id]
WHERE 1 = [Extent1].[Id]

Much better. Prima Facie these look preferable to both the original Entity-Framework-generated mess and the original stored procedure. Done!

Step 6: Measure

No! Just wait a minute! How do we know if we're Fast Enough? How do we even know if we're faster?

We measure!

And unfortunately you'll have to do this bit on your own. I can tell you that on my machine, on my network, simulating a realistic load for my application, the INNER JOIN is the fastest, followed by the two round-trips version (!!), followed by the WHERE EXISTS version, followed by the stored procedure. I can't tell you which will be fastest on your hardware, on your network, under a realistic load for your application.

I can tell you that I've made this exact performance optimization over a dozen times and depending on the characteristics of the network, database server, and schema I've seen all three of INNER JOIN, WHERE EXISTS, and two round-trips give the best performance.

However, I can't even tell you if any of these are Fast Enough. Depending on your needs you might need to hand-roll some hyper-optimized SQL and invoke a stored procedure. You might even need to go further and use a denormalised read-optimized read store. What about using an in-memory cache for your database results? What about using an output cache for your webserver? What if this query isn't even the bottleneck?

Good performance isn't about speeding up Entity Framework queries. Good performance, like just about anything in our industry, is about knowing what's important to your customer, and figuring out the best way to get it.

Question 2

The first thing I would recommend doing is calling ToString() on your linq queries to see the SQL being generated. Based on your query and your configuration, it is possible you are making two trips to the database, Once to get Table2, then again to get the associated Table1 entity via lazy loading. You should try to verify if this is the case either with SQL profiler or stepping through with the debugger. See if rewriting your query like the following adds any performance enhancement which eagerly loads the related entity:

var result = dataContext.Table2.
             .include("Table1")
             .FirstOrDefault(x => x.Table2ID == ID);

if(result != null){
    return result..Table1.StringValue;
}else{....}

Notice I also added in some logic checking if result is null. You are using FirstOrDefault, which will cause .Table1 to throw an exception if the result is not found. I would either change the call to First() if you never expect the result to be null, or handle the null case.

Another thing you should look at is how EF is configured to match against a NULL case, that could slow down your query. Check out this post (not to link to my own post, but its relevant): EntityFramework LINQToEntities generate weird slow TSQL Where-Clause

Question 3

This should yield the correct result but I can not tell how efficient it is; you will have to profile it. Note that the query will really only fetch a single string from the database and not require any client side processing by the Entity Framework.

dataContext.Table2
           .Where(x => (x.Table2ID == ID) || (x.Table2ID == null))
           .OrderByDescending(x => x.Table2ID) // This will place ID before NULL.
           .Select(x => x.Table1.StringValue)
           .First()

Using LINQPad I got more or less the expected SQL statement but I did not try if the Entity Framework will produce the same query. But because this is a single query there is even a slight chance that the Entity Framework can outperform the stored procedure with its conditional second query but obviously only because of the reformulated query.

 SELECT TOP (1) [t1].[StringValue]
           FROM [Table2] AS [t2]
LEFT OUTER JOIN [Table1] AS [t1]
             ON [t1].[Table1ID] = [t2].[Table1ID]
          WHERE ([t2].[Table2ID] = @ID) OR ([t2].[Table2ID] IS NULL)
       ORDER BY [t2].[Table2ID] DESC