How to avoid many database round trips and a lot of irrelevant data?

https://stackoverflow.com/questions/7103230

17-12-2020
|

题

I have worked with various applications and encountered this situation many times. Until now I have not figured out what is the best approach.

Here's the scenario:

I have an application either desktop or web
I need to retrieve simple documents from the database. The document has a general details and item details so the database:

GeneralDetails table:

| DocumentID | DateCreated | Owner     |
| 1          | 07/07/07    | Naruto    |
| 2          | 08/08/08    | Goku      |
| 3          | 09/09/09    | Taguro    |

ItemDetails table

| DocumentID | Item        | Quantity  |
| 1          | Marbles     | 20        |
| 1          | Cards       | 56        |
| 2          | Yo-yo       | 1         |
| 2          | Chess board | 3         |
| 2          | GI Joe      | 12        |
| 3          | Rubber Duck | 1         |

As you can see, the tables have a one-to-many relationship. Now, in order to retrieve all the documents and their respective items, I always do either of the two:

Method 1 - Many round trips (pseudo-code):

 Documents = GetFromDB("select DocumentID, Owner " +
                       "from GeneralDetails") 
 For Each Document in Documents
{
    Display(Document["CreatedBy"])
    DocumentItems = GetFromDB("select Item, Quantity " + 
                              "from ItemDetails " + 
                              "where DocumentID = " + Document["DocumentID"] + "")
    For Each DocumentItem in DocumentItems
    {
        Display(DocumentItem["Item"] + " " + DocumentItem["Quantity"])
    }
}

Method 2 - Much irrelevant data (pseudo-code):

DocumentsAndItems = GetFromDB("select g.DocumentID, g.Owner, i.Item, i.Quantity " + 
                              "from GeneralDetails as g " +
                              "inner join ItemDetails as i " +
                              "on g.DocumentID = i.DocumentID")
//Display...

I used the first method when I was in college for desktop applications, the performance was not bad so I realized it was okay.

Until one day, I saw an article "Make the web faster", it says that many round trips to the database is bad; so ever since then I have used the second method.

On the second method, I avoided round trips by using inner join to retrieve the first and the second table at once, but it produces unecessary or redundant data. See the result set.

| DocumentID | Owner     | Item        | Quantity  |
| 1          | Naruto    | Marbles     | 20        |
| 1          | Naruto    | Cards       | 56        |
| 2          | Goku      | Yo-yo       | 1         |
| 2          | Goku      | Chess board | 3         |
| 2          | Goku      | GI Joe      | 12        |
| 3          | Taguro    | Rubber Duck | 1         |

The result set has redundant DocumentID and Owner. It looks like an unnormalized database.

Now, the question is, how do I avoid round trips and at the same time avoid redundant data?

解决方案

The method used by ActiveRecord and other ORMs is to select the first table, batch together the IDs and then use those IDs in an IN clause for the second select.

SELECT * FROM ItemDetails WHERE DocumentId IN ( [Comma Separated List of IDs here] )

Advantages:

No redundant data

Disadvantages:

Two queries

Generally speaking, the first method is referred to as the "N+1 query problem" and the solutions are referred to as "eager loading". I tend to see your "Method 2" as preferable as the latency to the database generally trumps the size of the redundant data over the data transfer rate, but YRMV. As with almost everything in software, it's a tradeoff.

其他提示

The inner join is better because the database has more possibilities to optimize.

In general you can not create a query like this which does not produce redundant results. For that, the relational model is too restrictive. I would just live with it: the database is responsible to optimize these cases.

If you really encounter performance problems (mainly because of a network bottleneck) you could write a stored procedure, that makes the query and denormalizes it. In your example you create create a result like:

| DocumentID | Owner     | Items                                   | Quantity    |
| 1          | Naruto    | Marbles, Cards                          | 20, 56      |
| 2          | Goku      | Yo-yo, Chess board, GI Joe, Rubber Duck | 1, 3, 12, 1 |

But this of course does not conform to the first normal form - so you will need to parse it on the client. If you use a database with XML support (like Oracle or MS SQL Server) you could even create an XML file on the server and send this to the client.

But whatever you do, remember: premature optimization is the root of all evil. Don't do this kind of stuff before you are not 100% sure, that you are really facing a problem you can solve like this.

You can read first table, extract keys of rows you want from second table and retrieve them via second select.

Something like

DocumentItems = GetFromDB("select Item, Quantity " + 
                          "from ItemDetails " + 
                          "where DocumentID in (" + LISTING_OF_KEYS + ")")

Your second method is definitely a way to go. But you do not have to select columns you are not going to use. So if you only need Item and Quantity, do this:

DocumentsAndItems = GetFromDB("select i.Item, i.Quantity " + 
                          "from GeneralDetails as g " +
                          "inner join ItemDetails as i " +
                          "on g.DocumentID = i.DocumentID")

(I suppose you have other conditions that you would put in where part of the query, otherwise the join is not necessary.)

If you are using .NET and MS SQL Server, the simple solution here would be to look into using MARS (Multiple Active Resultsets). Here's a sample code block pulled straight from the Visual Studio 2015 Help on a MARS demo:

using System;
using System.Data;
using System.Data.SqlClient;

class Class1
{
  static void Main()
  {
     // By default, MARS is disabled when connecting
     // to a MARS-enabled host.
     // It must be enabled in the connection string.
     string connectionString = GetConnectionString();

     int vendorID;
     SqlDataReader productReader = null;
     string vendorSQL = 
       "SELECT VendorId, Name FROM Purchasing.Vendor";
     string productSQL = 
       "SELECT Production.Product.Name FROM Production.Product " +
       "INNER JOIN Purchasing.ProductVendor " +
       "ON Production.Product.ProductID = " + 
       "Purchasing.ProductVendor.ProductID " +
       "WHERE Purchasing.ProductVendor.VendorID = @VendorId";

   using (SqlConnection awConnection = 
      new SqlConnection(connectionString))
   {
      SqlCommand vendorCmd = new SqlCommand(vendorSQL, awConnection);
      SqlCommand productCmd = 
        new SqlCommand(productSQL, awConnection);

      productCmd.Parameters.Add("@VendorId", SqlDbType.Int);

      awConnection.Open();
      using (SqlDataReader vendorReader = vendorCmd.ExecuteReader())
      {
        while (vendorReader.Read())
        {
          Console.WriteLine(vendorReader["Name"]);

          vendorID = (int)vendorReader["VendorId"];

          productCmd.Parameters["@VendorId"].Value = vendorID;
          // The following line of code requires
          // a MARS-enabled connection.
          productReader = productCmd.ExecuteReader();
          using (productReader)
          {
            while (productReader.Read())
            {
              Console.WriteLine("  " +
                productReader["Name"].ToString());
            }
          }
        }
      }
      Console.WriteLine("Press any key to continue");
      Console.ReadLine();
    }
  }
  private static string GetConnectionString()
  {
    // To avoid storing the connection string in your code,
    // you can retrive it from a configuration file.
    return "Data Source=(local);Integrated Security=SSPI;" + 
      "Initial Catalog=AdventureWorks;MultipleActiveResultSets=True";
  }
 }

Hopefully this puts you on a path to understanding. There are many different philosophies on the subject of round-tripping, and much of it depends on the type of application you're writing and the data store you're connecting to. If this is an intranet project and there aren't a huge number of simultaneous users then a large number of round-trips to the database is not the issue or concern you think it is, except for how it looks to your reputation not to have more streamlined code! (grin) If this is a web application then that's a different story, and you should try to ensure you aren't going back to the well too frequently if at all avoidable. MARS is a good answer for resolving this, since everything comes back from the server in one shot, and it's then up to you to iterate through the returned data. Hope this is useful to you!

The answer depends on your task.

1. If you want to generate List/Report then you need Method-2 with redundant data. You transfer more data over the network, but save time on generating the content.

2. If you want to display General list first and then display details by user's click then it is better to use Method-1. To generate and send limited data set will be very fast.

3. If you want to pre-load all data into the app then you can use XML. It will provide ALL not redundant data. However, there is an additional programming with XML coding in SQL and decoding on the client.

I'd do something like this to generate XML on SQL side:

;WITH t AS (
    SELECT g.DocumentID, g.Owner, i.Item, i.Quantity
    FROM GeneralDetails AS g
    INNER JOIN ItemDetails AS i 
    ON g.DocumentID = i.DocumentID
)
SELECT 1 as Tag, Null as Parent, 
    DocumentID as [Document!1!DocumentID],
    Owner as [Document!1!Owner],
    NULL as [ItemDetais!2!Item],
    NULL as [ItemDetais!2!Quantity]
FROM t GROUP BY DocumentID, Owner
UNION ALL
SELECT 2 as Tag, 1 as Parent, DocumentID, Owner, Item, Quantity
FROM t 
ORDER BY [Document!1!DocumentID], [Document!1!Owner], [ItemDetais!2!Item], [ItemDetais!2!Quantity]
FOR XML EXPLICIT;

As far as I see it you have a number of options

Concat your strings so that all of your items will appear without redundant data. i.e. "Marbles, Cards"
Return your query as a compressed XML file that your program can then parse as if it were the database.
- This gives you the advantage of only one trip, but you also get all of the data in one file that could be massive.
This item would be my person preference, implement a form of lazy loading.
- This means that the "additional" data is only loaded when required. So while this does have multiple trips, the trips are only to get the required data.

Somehow in my application with ~200 forms/screens and a database with ~300 tables I never had a need for neither first nor second method.

In my application quite often a user sees on screen two grids (tables), next to each other:

main GeneralDetails table with the list of documents (usually there is search function that limits results using a bunch various filters).
data from the ItemDetails table for the selected document. Not for all documents. Just for one current document. When a user selects a different document in the first grid, I (re)run a query to retrieve details of the selected document. Just for one selected document.

So, there is no join between master and details table. And, there is no loop to retrieve details for all master documents.

Why would you need to have details for all documents on the client?

I would say, that best practices boil down to common sense:

It is always good to transmit over the network only the data that you need, without redundancy. And it is always good to keep the number of queries/requests as low as possible. Instead of sending many requests in a loop, send one request that would return all necessary rows. Then slice and dice it on the client if really needed.

If there is a need to process somehow a batch of documents together with their details, that's a different story and so far I've always managed to do it on the server side, without transferring all this data to the client.

If for some reason there is a need to get a list of all master documents together with details for all documents to the client, I'd make two queries without any loops:

SELECT ... FROM GeneralDetails

SELECT ... FROM ItemDetails

These two queries would return two arrays of data and if needed, I would combine the master-detail data in the internal structures in memory on the client.

You can further optimize this process by retrieve the data you need from the two tables separately. Afterwards you can either loop through the records or join the tables to produce the same result set as it was coming from the SQL server.

With an ORM you can retrieve the entities separately in two round trips - one to retrieve GeneralDetails and another one to retrieve ItemDetails after examination of GeneralDetails.DocumentId. Altough there are two round trips to the DB it's way optimized than any of the other two methods.

Here is an NHibernate example:

void XXX()
{
    var queryGeneral = uow.Session.QueryOver<GeneralDetails>();
    var theDate = DateTime.Now.Subtract(5);
    queryGeneral.AndRestrictionOn(c => c.SubmittedOn).IsBetween(theDate).And(theDate.AddDays(3));

    // Whatever other criteria applies.

    var generalDetails = queryGeneral.List();

    var neededDocIds = generalDetails.Select(gd => gd.DocumentId).Distinct().ToArray();

    var queryItems = uow.Session.QueryOver<ItemDetails>();
    queryItem.AndRestrictionOn(id => id.DocumentId).IsIn(neededDocs);

    var itemDetails = queryItems.List();

    // The records from both tables are now in the generalDetails and itemDetails lists so you can manipulate them in memory...
}

I believe (don't have live example) with an ADO.NET data set you can actually save the second round trip to the DB. You don't even need to join the results; it's a matter of coding style and a workflow, but generally you could update your UI by working with the two result sets simultaneously,

void YYY()
{
    var sql = "SELECT *  FROM GeneralDetails WHERE DateCreated BETWEEN '2015-06-01' AND '2015-06-20';";
    sql += @"
            WITH cte AS (
                SELECT DocumentId FROM GeneralDetails WHERE DateCreated BETWEEN '2015-06-01' AND '2015-06-20'
            )
            SELECT * FROM ItemDetails INNER JOIN cte ON ItemDetails.DocumentId = cte.DocumentId";

    var ds = new DataSet();

    using (var conn = new SqlConnection("a conn string"))
    using (var da = new SqlDataAdapter())
    {
        conn.Open();
        da.SelectCommand = conn.CreateCommand();
        da.SelectCommand.CommandText = sql;
        da.Fill(ds);
    }

    // Now the two table are in the dataset so you can loop through them and do your stuff...
}

Note: I wrote the above code solely in the example sake and is not tested!

Since the time I asked this question, I realized there are other areas I can optimize my application in retrieving data. In such a case, I will do the following:

Ask myself, do I really need to retrieve many documents together with their subitems? Usually in a UI, I display records in a list, only when the user needs the subitems (if the user clicks the record) I will retrieve them.
If it is really necessary to display many records with subitems, post/comments for example, I will only provide some posts, think of pagination, or provide a "load more" function.

To summarize, I may end up doing lazy loading, retrieve data only when the user needs it.

A solution to avoid roundtrips to database server, although does not guarantee a performance boost since it requires more processing in the database server and in the application, is to retrieve multiple recordsets, one results to parent documents, and one results to subitems, see pseudocode:

 recordSets = GetData
     ("select * from parentDocs where [condition] ;
        select * from subItems where [condition]")

 //join the parent documents and subitems here

I might need a temporary table here for parent documents so I can use it for the condition in the second query, since I only need to retrieve subitems of the selected parent documents.

I should also point out that doing a benchmark is better than just applying principles right away since it's really a case to case basis.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow