Need advice on reporting with big amounts of data

https://softwareengineering.stackexchange.com/questions/405436

07-03-2021
|

题

I have a monolith application in .net core 3.0 with entity framework core 3.0. using:

a table with ~3 million records. Its structure is BusinessUnitId | ProfileId | Amount(it has more fields, but these are important ones);
another table that looks like this: ProfileId | Price.

When I need to generate report for all Business Units, I have to read ~3kk records from one table, then match profiles with their price, then generate view model and send it.

Problem is - it takes more than one minute, and this report should appear on web page in form of table, i.e. take not more than ~5 seconds.

I tried to create separate table just for this report, but it's population/recalculation takes like 10 minutes. Currently I think that I will move report data to separate table and make a service that will update it when needed.

What is best practice or standard approach for these kind of problems, when you need to calculate huge amount of data in run time/perform queries that take long to execute, but need it to be fast?

解决方案

If the purpose is to display on a page the first step is to limit the amount of data returned. No one can process 3 million rows at once, find ways to limit the data returned. Requiring search criteria or only returning X rows per page can drastically speed up performance with no real impact to usability. Ensure you are using SQL to create the result set, and not using Entity Framework to get all of table A and all of Table B then combining in application memory, this is massively wasteful of resources. With proper indexes a simple join even over 3 million rows (which you should never really have to do) shouldn't be taking that long. A stored procedure may be a better option than relying on Entity Framework to generate the SQL, the ability to have a cached execution plan may also lead to some performance gains. Remember a database is far more than simple data storage, it is really good at taking that data and transforming it into all sorts of result sets, use that power rather than try to weakly re implement it.

In extreme cases reporting tables can be used that store the data in a de-normalized format for faster reads. These can be maintained by triggers on the underlying tables or scheduled jobs to pull in changes. This is generally done to avoid locking the underlying tables for reporting purposes, or because data relationships are complex and take significant time to calculate. In the case of two tables that simply results in a lot of rows, there isn't going to be much benefit.

In either case the true key is to not try and send 3 million records over the wire. The transmission time for that is going to be the biggest bottleneck, and their isn't anything you can do about it through programming. Always try to do as much filtering as possible at the database level so you can send the smallest possible amount of data to your app and to your user.

其他提示

Something has to be off with your query. Is the ProfileId a foreign key and are you using Include to load the related data?

If you can live with older data, it's most effective to cache the data, either explicitly or even as the view model.

The other option is to periodically regenerate the data using a scheduler. If you are on SQL server, it comes with a scheduler called agent that directly accepts SQL scripts.

When I need to generate report for all Business Units, I have to read ~3kk records from one table, then match profiles with their price, then generate view model and send it.

This suggests that you are running one query to get a set of data, then looping through that to query another. This is painfully easy to do with things like Entity Framework and performs appallingly badly (this "1+N Query" model is also untunable and unscaleable).

From the simple data you've described, you should need only one query:

select 
  bu.BusinessUnitId 
, bu.Amount 
, p.ProfileId 
, p.Price 
, . . . 
from       BusinessUnits bu 
inner join Profiles p 
      on   bu.profileId = p.profileId
. . .

Get the Execution Plan for this query - you may find that you need to add an Index or two to support these join conditions.

许可以下： CC-BY-SA 和归因

不隶属于 softwareengineering.stackexchange