What is the “N+1 selects problem” in ORM (Object-Relational Mapping)?

https://stackoverflow.com/questions/97197

01-07-2019
|

Question

The "N+1 selects problem" is generally stated as a problem in Object-Relational mapping (ORM) discussions, and I understand that it has something do to with having to make a lot of database queries for something that seems simple in the object world.

Does anybody have a more detailed explanation of the problem?

Solution

Let's say you have a collection of Car objects (database rows), and each Car has a collection of Wheel objects (also rows). In other words, Car -> Wheel is a 1-to-many relationship.

Now, let's say you need to iterate through all the cars, and for each one, print out a list of the wheels. The naive O/R implementation would do the following:

SELECT * FROM Cars;

And then for each Car:

SELECT * FROM Wheel WHERE CarId = ?

In other words, you have one select for the Cars, and then N additional selects, where N is the total number of cars.

Alternatively, one could get all wheels and perform the lookups in memory:

SELECT * FROM Wheel

This reduces the number of round-trips to the database from N+1 to 2. Most ORM tools give you several ways to prevent N+1 selects.

Reference: Java Persistence with Hibernate, chapter 13.

OTHER TIPS

SELECT 
table1.*
, table2.*
INNER JOIN table2 ON table2.SomeFkId = table1.SomeId

That gets you a result set where child rows in table2 cause duplication by returning the table1 results for each child row in table2. O/R mappers should differentiate table1 instances based on a unique key field, then use all the table2 columns to populate child instances.

SELECT table1.*

SELECT table2.* WHERE SomeFkId = #

The N+1 is where the first query populates the primary object and the second query populates all the child objects for each of the unique primary objects returned.

Consider:

class House
{
    int Id { get; set; }
    string Address { get; set; }
    Person[] Inhabitants { get; set; }
}

class Person
{
    string Name { get; set; }
    int HouseId { get; set; }
}

and tables with a similar structure. A single query for the address "22 Valley St" may return:

Id Address      Name HouseId
1  22 Valley St Dave 1
1  22 Valley St John 1
1  22 Valley St Mike 1

The O/RM should fill an instance of Home with ID=1, Address="22 Valley St" and then populate the Inhabitants array with People instances for Dave, John, and Mike with just one query.

A N+1 query for the same address used above would result in:

Id Address
1  22 Valley St

with a separate query like

SELECT * FROM Person WHERE HouseId = 1

and resulting in a separate data set like

Name    HouseId
Dave    1
John    1
Mike    1

and the final result being the same as above with the single query.

The advantages to single select is that you get all the data up front which may be what you ultimately desire. The advantages to N+1 is query complexity is reduced and you can use lazy loading where the child result sets are only loaded upon first request.

Supplier with a one-to-many relationship with Product. One Supplier has (supplies) many Products.

***** Table: Supplier *****
+-----+-------------------+
| ID  |       NAME        |
+-----+-------------------+
|  1  |  Supplier Name 1  |
|  2  |  Supplier Name 2  |
|  3  |  Supplier Name 3  |
|  4  |  Supplier Name 4  |
+-----+-------------------+

***** Table: Product *****
+-----+-----------+--------------------+-------+------------+
| ID  |   NAME    |     DESCRIPTION    | PRICE | SUPPLIERID |
+-----+-----------+--------------------+-------+------------+
|1    | Product 1 | Name for Product 1 |  2.0  |     1      |
|2    | Product 2 | Name for Product 2 | 22.0  |     1      |
|3    | Product 3 | Name for Product 3 | 30.0  |     2      |
|4    | Product 4 | Name for Product 4 |  7.0  |     3      |
+-----+-----------+--------------------+-------+------------+

Factors:

Lazy mode for Supplier set to “true” (default)
Fetch mode used for querying on Product is Select
Fetch mode (default): Supplier information is accessed
Caching does not play a role for the first time the
Supplier is accessed

Fetch mode is Select Fetch (default)

// It takes Select fetch mode as a default
Query query = session.createQuery( "from Product p");
List list = query.list();
// Supplier is being accessed
displayProductsListWithSupplierName(results);

select ... various field names ... from PRODUCT
select ... various field names ... from SUPPLIER where SUPPLIER.id=?
select ... various field names ... from SUPPLIER where SUPPLIER.id=?
select ... various field names ... from SUPPLIER where SUPPLIER.id=?

Result:

1 select statement for Product
N select statements for Supplier

This is N+1 select problem!

I can't comment directly on other answers, because I don't have enough reputation. But it's worth noting that the problem essentially only arises because, historically, a lot of dbms have been quite poor when it comes to handling joins (MySQL being a particularly noteworthy example). So n+1 has, often, been notably faster than a join. And then there are ways to improve on n+1 but still without needing a join, which is what the original problem relates to.

However, MySQL is now a lot better than it used to be when it comes to joins. When I first learned MySQL, I used joins a lot. Then I discovered how slow they are, and switched to n+1 in the code instead. But, recently, I've been moving back to joins, because MySQL is now a heck of a lot better at handling them than it was when I first started using it.

These days, a simple join on a properly indexed set of tables is rarely a problem, in performance terms. And if it does give a performance hit, then the use of index hints often solves them.

This is discussed here by one of the MySQL development team:

http://jorgenloland.blogspot.co.uk/2013/02/dbt-3-q3-6-x-performance-in-mysql-5610.html

So the summary is: If you've been avoiding joins in the past because of MySQL's abysmal performance with them, then try again on the latest versions. You'll probably be pleasantly surprised.

We moved away from the ORM in Django because of this problem. Basically, if you try and do

for p in person:
    print p.car.colour

The ORM will happily return all people (typically as instances of a Person object), but then it will need to query the car table for each Person.

A simple and very effective approach to this is something I call "fanfolding", which avoids the nonsensical idea that query results from a relational database should map back to the original tables from which the query is composed.

Step 1: Wide select

  select * from people_car_colour; # this is a view or sql function

This will return something like

  p.id | p.name | p.telno | car.id | car.type | car.colour
  -----+--------+---------+--------+----------+-----------
  2    | jones  | 2145    | 77     | ford     | red
  2    | jones  | 2145    | 1012   | toyota   | blue
  16   | ashby  | 124     | 99     | bmw      | yellow

Step 2: Objectify

Suck the results into a generic object creator with an argument to split after the third item. This means that "jones" object won't be made more than once.

Step 3: Render

for p in people:
    print p.car.colour # no more car queries

See this web page for an implementation of fanfolding for python.

Suppose you have COMPANY and EMPLOYEE. COMPANY has many EMPLOYEES (i.e. EMPLOYEE has a field COMPANY_ID).

In some O/R configurations, when you have a mapped Company object and go to access its Employee objects, the O/R tool will do one select for every employee, wheras if you were just doing things in straight SQL, you could select * from employees where company_id = XX. Thus N (# of employees) plus 1 (company)

This is how the initial versions of EJB Entity Beans worked. I believe things like Hibernate have done away with this, but I'm not too sure. Most tools usually include info as to their strategy for mapping.

Here's a good description of the problem - https://web.archive.org/web/20160310145416/http://www.realsolve.co.uk/site/tech/hib-tip-pitfall.php?name=why-lazy

Now that you understand the problem it can typically be avoided by doing a join fetch in your query. This basically forces the fetch of the lazy loaded object so the data is retrieved in one query instead of n+1 queries. Hope this helps.

In my opinion the article written in Hibernate Pitfall: Why Relationships Should Be Lazy is exactly opposite of real N+1 issue is.

If you need correct explanation please refer Hibernate - Chapter 19: Improving Performance - Fetching Strategies

Select fetching (the default) is extremely vulnerable to N+1 selects problems, so we might want to enable join fetching

Check Ayende post on the topic: Combating the Select N + 1 Problem In NHibernate

Basically, when using an ORM like NHibernate or EntityFramework, if you have a one-to-many (master-detail) relationship, and want to list all the details per each master record, you have to make N + 1 query calls to the database, "N" being the number of master records: 1 query to get all the master records, and N queries, one per master record, to get all the details per master record.

More database query calls --> more latency time --> decreased application/database performance.

However, ORM's have options to avoid this problem, mainly using "joins".

The N+1 query issue happens when you forget to fetch an association and then you need to access it:

List<PostComment> comments = entityManager.createQuery(
    "select pc " +
    "from PostComment pc " +
    "where pc.review = :review", PostComment.class)
.setParameter("review", review)
.getResultList();

LOGGER.info("Loaded {} comments", comments.size());

for(PostComment comment : comments) {
    LOGGER.info("The post title is '{}'", comment.getPost().getTitle());
}

Which generates the following SQL statements:

SELECT pc.id AS id1_1_, pc.post_id AS post_id3_1_, pc.review AS review2_1_
FROM   post_comment pc
WHERE  pc.review = 'Excellent!'

INFO - Loaded 3 comments

SELECT pc.id AS id1_0_0_, pc.title AS title2_0_0_
FROM   post pc
WHERE  pc.id = 1

INFO - The post title is 'Post nr. 1'

SELECT pc.id AS id1_0_0_, pc.title AS title2_0_0_
FROM   post pc
WHERE  pc.id = 2

INFO - The post title is 'Post nr. 2'

SELECT pc.id AS id1_0_0_, pc.title AS title2_0_0_
FROM   post pc
WHERE  pc.id = 3

INFO - The post title is 'Post nr. 3'

First, Hibernate executes the JPQL query, and a list of PostComment entities is fetched.

Then, for each PostComment, the associated post property is used to generate a log message containing the Post title.

Because the post association is not initialized, Hibernate must fetch the Post entity with a secondary query, and for N PostComment entities, N more queries are going to be executed (hence the N+1 query problem).

First, you need proper SQL logging and monitoring so that you can spot this issue.

Second, this kind of issue is better to be caught by integration tests. You can use an automatic JUnit assert to validate the expected count of generated SQL statements. The db-unit project already provides this functionality, and it's open source.

When you identified the N+1 query issue, you need to use a JOIN FETCH so that child associations are fetched in one query, instead of N. If you need to fetch multiple child associations, it's better to fetch one collection in the initial query and the second one with a secondary SQL query.

The supplied link has a very simply example of the n + 1 problem. If you apply it to Hibernate it's basically talking about the same thing. When you query for an object, the entity is loaded but any associations (unless configured otherwise) will be lazy loaded. Hence one query for the root objects and another query to load the associations for each of these. 100 objects returned means one initial query and then 100 additional queries to get the association for each, n + 1.

http://pramatr.com/2009/02/05/sql-n-1-selects-explained/

One millionaire has N cars. You want to get all (4) wheels.

One (1) query loads all the cars, but for each (N) car a separate query is submitted for loading wheels.

Costs:

Assume indexes fit into ram.

1 + N query parsing and planing + index searching AND 1 + N + (N * 4) plate access for loading payload.

Assume indexes don't fit into ram.

Additional costs in worst case 1 + N plate accesses for loading index.

Summary

Bottle neck is plate access (ca. 70 times per second random access on hdd) An eager join select would also access the plate 1 + N + (N * 4) times for payload. So if the indexes fit into ram - no problem, its fast enough because only ram operations involved.

It is much faster to issue 1 query which returns 100 results than to issue 100 queries which each return 1 result.

N+1 select issue is a pain, and it makes sense to detect such cases in unit tests. I have developed a small library for verifying the number of queries executed by a given test method or just an arbitrary block of code - JDBC Sniffer

Just add a special JUnit rule to your test class and place annotation with expected number of queries on your test methods:

@Rule
public final QueryCounter queryCounter = new QueryCounter();

@Expectation(atMost = 3)
@Test
public void testInvokingDatabase() {
    // your JDBC or JPA code
}

The issue as others have stated more elegantly is that you either have a Cartesian product of the OneToMany columns or you're doing N+1 Selects. Either possible gigantic resultset or chatty with the database, respectively.

I'm surprised this isn't mentioned but this how I have gotten around this issue... I make a semi-temporary ids table. I also do this when you have the IN () clause limitation.

This doesn't work for all cases (probably not even a majority) but it works particularly well if you have a lot of child objects such that the Cartesian product will get out of hand (ie lots of OneToMany columns the number of results will be a multiplication of the columns) and its more of a batch like job.

First you insert your parent object ids as batch into an ids table. This batch_id is something we generate in our app and hold onto.

INSERT INTO temp_ids 
    (product_id, batch_id)
    (SELECT p.product_id, ? 
    FROM product p ORDER BY p.product_id
    LIMIT ? OFFSET ?);

Now for each OneToMany column you just do a SELECT on the ids table INNER JOINing the child table with a WHERE batch_id= (or vice versa). You just want to make sure you order by the id column as it will make merging result columns easier (otherwise you will need a HashMap/Table for the entire result set which may not be that bad).

Then you just periodically clean the ids table.

This also works particularly well if the user selects say 100 or so distinct items for some sort of bulk processing. Put the 100 distinct ids in the temporary table.

Now the number of queries you are doing is by the number of OneToMany columns.

Take Matt Solnit example, imagine that you define an association between Car and Wheels as LAZY and you need some Wheels fields. This means that after the first select, hibernate is going to do "Select * from Wheels where car_id = :id" FOR EACH Car.

This makes the first select and more 1 select by each N car, that's why it's called n+1 problem.

To avoid this, make the association fetch as eager, so that hibernate loads data with a join.

But attention, if many times you don't access associated Wheels, it's better to keep it LAZY or change fetch type with Criteria.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow