Optimize entity framework query

Question 1

Most likely the problem you are experiencing is a Cartesian product.

Based on just some sample data:

var query = ctx.Questions // 50 
  .Include("Attachments") // 20                                
  .Include("Location") // 10
  .Include("CreatedBy") // 5
  .Include("Tags") // 5
  .Include("Upvotes") // 5
  .Include("Upvotes.CreatedBy") // 5
  .Include("Downvotes") // 5
  .Include("Downvotes.CreatedBy") // 5

  // Where Blah
  // Order By Blah

This returns a number of rows upwards of

50 x 20 x 10 x 5 x 5 x 5 x 5 x 5 x 5 = 156,250,000

Seriously... that is an INSANE number of rows to return.

You really have two options if you are having this issue:

First: The easy way, rely on Entity-Framework to wire up models automagically as they enter the context. And afterwards, use the entities AsNoTracking() and dispose of the context.

// Continuing with the query above:

var questions = query.Select(q => q);
var attachments = query.Select(q => q.Attachments);
var locations = query.Select(q => q.Locations);

This will make a request per table, but instead of 156 MILLION rows, you only download 110 rows. But the cool part is they are all wired up in EF Context Cache memory, so now the questions variable is completely populated.

Second: Create a stored procedure that returns multiple tables and have EF materialize the classes.

New Third: EF Now support splitting queries as above, while keeping the nice .Include() methods. Split Queries do have a few gotcha's so I recommend reading all the documentation.

Example from the above link:

If a typical blog has multiple related posts, rows for these posts will duplicate the blog's information. This duplication leads to the so-called "cartesian explosion" problem.

using (var context = new BloggingContext())
{
    var blogs = context.Blogs
        .Include(blog => blog.Posts)
        .AsSplitQuery()
        .ToList();
}

It will produce the following SQL:

SELECT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]
FROM [Blogs] AS [b]
ORDER BY [b].[BlogId]

SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title], [b].[BlogId]
FROM [Blogs] AS [b]
INNER JOIN [Post] AS [p] ON [b].[BlogId] = [p].[BlogId]
ORDER BY [b].[BlogId]

Question 2

I don't see anything obviously wrong with your LINQ query (.AsQueryable() shouldn't be mandatory, but it won't change anything if you remove it). Of course, don't include unnecessary navigation properties (each one adds a SQL JOIN), but if everything is required, it should be OK.

Now as the C# code looks OK, it's time to see the generated SQL code. As you already did, the first step is to retrieve the SQL query that is executed. There are .Net ways of doing it, for SQL Server I personally always starts a SQL Server profiling session.

Once you have the SQL query, try to execute it directly against your database, and don't forget to include the actual execution plan. This will show you exactly which part of your query takes the majority of the time. It will even indicate you if there are obvious missing indexes.

Now the question is, should you add all these indexes your SQL Server tells you they are missing? Not necessarily. See for example Don't just blindly create those missing indexes. You'll have to choose which indexes should be added, which shouldn't.

As code-first approach created indexes for you, I'm assuming those are indexes on the primary and foreign keys only. That's a good start, but that's not enough. I don't known about the number of rows in your tables, but an obvious index that only you can add (no code-generation tool can do that because it's related to your business queries), is for example an index on the CreatedDate column, as you're ordering your items by this value. If you don't, SQL Server will have to execute a table scan on 1M rows, which will of course be disastrous in terms of performances.

So :

try to remove some Include if you can
look at the actual execution plan to see where is the performance issue in your query
add only the missing indexes that make sense, depending on how you're ordering/filtering the data you're getting from the DB

Question 3

As you already know, Include method generate monstrous SQL.

Disclaimer: I'm the owner of the project Entity Framework Plus (EF+)

The EF+ Query IncludeOptimized method allows optimizing the SQL generated exactly like EF Core does.

Instead of generating a monstrous SQL, multiple SQL are generated (one for each include). This feature also as a bonus, it allows filtering related entities.

Docs: EF+ Query IncludeOptimized

var query = ctx.Questions
               .AsNoTracking()
               .IncludeOptimized(x => x.Attachments)                                
               .IncludeOptimized(x => x.Location)
               .IncludeOptimized(x => x.CreatedBy) //IdentityUser
               .IncludeOptimized(x => x.Tags)
               .IncludeOptimized(x => x.Upvotes)
               .IncludeOptimized(x => x.Upvotes.Select(y => y.CreatedBy))
               .IncludeOptimized(x => x.Downvotes)
               .IncludeOptimized(x => x.Downvotes.Select(y => y.CreatedBy))
               .AsQueryable();

Question 4

Take a look at section 8.2.2 of this document from Microsoft:

8.2.2 Performance concerns with multiple Includes

When we hear performance questions that involve server response time problems, the source of the issue is frequently queries with multiple Include statements. While including related entities in a query is powerful, it's important to understand what's happening under the covers.

It takes a relatively long time for a query with multiple Include statements in it to go through our internal plan compiler to produce the store command. The majority of this time is spent trying to optimize the resulting query. The generated store command will contain an Outer Join or Union for each Include, depending on your mapping. Queries like this will bring in large connected graphs from your database in a single payload, which will acerbate any bandwidth issues, especially when there is a lot of redundancy in the payload (i.e. with multiple levels of Include to traverse associations in the one-to-many direction).

You can check for cases where your queries are returning excessively large payloads by accessing the underlying TSQL for the query by using ToTraceString and executing the store command in SQL Server Management Studio to see the payload size. In such cases you can try to reduce the number of Include statements in your query to just bring in the data you need. Or you may be able to break your query into a smaller sequence of subqueries, for example:

Before breaking the query:
using (NorthwindEntities context = new NorthwindEntities()) {
var customers = from c in context.Customers.Include(c => c.Orders)
                where c.LastName.StartsWith(lastNameParameter)
                select c;

foreach (Customer customer in customers)
{
    ...
} }
After breaking the query:
using (NorthwindEntities context = new NorthwindEntities()) {
var orders = from o in context.Orders
             where o.Customer.LastName.StartsWith(lastNameParameter)
             select o;

orders.Load();

var customers = from c in context.Customers
                where c.LastName.StartsWith(lastNameParameter)
                select c;

foreach (Customer customer in customers)
{
    ...
} }
This will work only on tracked queries, as we are making use of the ability the context has to perform identity resolution and association fixup automatically.

As with lazy loading, the tradeoff will be more queries for smaller payloads. You can also use projections of individual properties to explicitly select only the data you need from each entity, but you will not be loading entities in this case, and updates will not be supported.

Question 5

I disagree with Ken2k's answer and am surprised that it has as many upvotes as it does.

The code may be fine in the sense that it compiles, but having that many includes is definitely not OK if you care about your queries being performant. See 8.2.2 of MSFT's EF6 Performance Whitepaper:

When we hear performance questions that involve server response time problems, the source of the issue is frequently queries with multiple Include statements.

Taking a look at the TSQL that EF generates from eagerly loading that many navigation properties in one query (via the numerous .Include() statements) will make it obvious why this is no good. You're going to end up with way too many EF generated joins in one query.

Break up your query so that there are no more than 2 .Include() statements per table fetch. You can do a separate .Load() per dataset but you most likely don't need to go that far, YMMV.

var query = ctx.Questions.Where(...);
// Loads Questions, Attachments, Location tables
query.Include(q => q.Attachments)
     .Include(q => q.Location)
     .Load();

// Loads IdentityUsers Table
query.Select(q => q.CreatedBy).Load();
// Loads Tags
query.Select(q => q.Tags).Load();

// Loads Upvotes and Downvotes
query.Include(q => q.Upvotes)
     .Include(q => q.Downvotes)
     .Load();

// Assuming Upvotes.CreatedBy and Downvotes.CreatedBy are also an IdentityUser,
// then you don't need to do anything further as the IdentityUser table is loaded
// from query.Select(q => q.CreatedBy).Load(); and EF will make this association for you

Erik mentions that you can use .AsNoTracking(), and I'm not totally sure at what point he is recommending to use this but if you need to consume the resulting entity set with populated navigation properties (for example query above) you cannot use .AsNoTracking() at this invalidates the association between entities in EF's cache (once again, from 8.2.2 of MSFT's doc):

This [breaking up the EF query] will work only on tracked queries, as we are making use of the ability the context has to perform identity resolution and association fixup automatically.

For added performance if your query is read only, i.e. you are not updating values you can set the following properties on your DbContext (assuming you eagerly load all required data):

        Configuration.LazyLoadingEnabled = false;
        Configuration.AutoDetectChangesEnabled = false;
        Configuration.ProxyCreationEnabled = false;

Finally your DbContext should have a Per-Request lifetime/scope.

To Ken's point, certainly if your database architecture is a mess running profiler / viewing the execution plan can help you tweak indexes / identify other problems, but before even thinking of opening profiler break up your query limiting the number of .Includes() per .Load() and you should see a tremendous speed improvement from this alone.