Question

Working on a dashboard page which does a lot of analytics to display BOTH graphical and tabular data to users.

When the dashboard is filtered by a given year, I have to display analytics for the selected year, another year chosen for comparison, and historical averages from all time.

For the selected and comparison years, I create start/end DateTime objects that are set to the beginning_of_year and end_of_year.

year = Model.where("closed_at >= ?", start).where("closed_at <= ?", end).all
comp = Model.where("closed_at >= ?", comp_start).where("closed_at <= ?", comp_end).all

These queries are essentially the same, just different date filters. I don't really see any way to optimize this besides trying to only "select(...)" the fields I need, which will probably be all of them.

Since there will be an average of 250-1000 records in a given year, they aren't "horrible" (in my not-very-skilled opinion).

However, the historical averages are causing me a lot of pain. In order to adequately show the averages, I have to query ALL the records for all time and perform calculations on them. This is a bad idea, but I don't know how to get around it.

all_for_average = Model.all

Surely people have run into these kinds of problems before and have some means of optimizing them? Returning somewhere in the ballpark of 2,000 - 50,000 records for historical average analysis can't be very efficient. However, I don't see another way to perform the analysis unless I first retrieve the records.

Option 1: Grab everything and filter using Ruby

Since I'm already grabbing everything via Model.all, I "could" remove the 2 year queries by simply grabbing the desired records from the historical average instead. But this seems wrong...I'm literally "downloading" my DB (so to speak) and then querying it with Ruby code instead of SQL. Seems very inefficient. Has anyone tried this before and seen any performance gains?

Option 2: Using multiple SQL DB calls to get select information

This would mean instead of grabbing all records for a given time period, I would make several DB queries to get the "answers" from the DB instead of analyzing the data in Ruby.

Instead of running something like this,

year = Model.where("closed_at >= ?", start).where("closed_at <= ?", end).all

I would perform multiple queries:

year_total_count = Model.where(DATE RANGE).size
year_amount_sum = Model.where(DATE RANGE).sum("amount")
year_count_per_month = Model.where(DATE RANGE).group("MONTH(closed_at)")
...other queries to extract selected info...

Again, this seems very inefficient, but I'm not knowledgeable enough about SQL and Ruby code efficiencies to know which would lead to obvious downsides.

I "can" code both routes and then compare them with each other, but it will take a few days to code/run them since there's a lot of information on the dashboard page I'm leaving out. Certainly these situations have been run into multiple times for dashboard/analytics pages; is there a general principle for these types of situations?

I'm using PostgreSQL on Rails 4. I've been looking into DB-specific solutions as well, as being "database agnostic" really is irrelevant for most applications.

Was it helpful?

Solution 2

After discussing the issue with other more experienced DBAs and developers, I decided I was trying to optimize a problem that didn't need any optimization yet.

For my particular use case, I would have a few hundred users a day running these queries anywhere from 5-20 times each, so I wasn't really having major performance issues (ie, I'm not a Google or Amazon servicing billions of requests a day).

I am actually just having the PostgreSQL DB execute the queries each time and I haven't noticed any major performance issues for my users; the page loads very quickly and the queries/graphs have no noticeable delay.

For others trying to solve similar issues, I recommend trying to run it for a while a staging environment to see if you really have a problem that needs solving in the first place.

If I hit performance hiccups, my first step will be specifically indexing data that I query on, and my 2nd step will be creating DB views that "pre-load" the queries more efficiently than querying them over live data each time.

Thanks to the incredible advances in DB speed and technology, however, I don't have to worry about this problem.

I'm answering my own question so others can spend time resolving more profitable questions.

OTHER TIPS

Dan, I would look into using a materialized view (MV) for the all-time historical average. This would definitely fall under the "DB-specific" solutions category, as MVs are implemented differently in different databases (or sometimes not at all). Here is the basic PG documentation.

A materialized view is essentially a physical table, except its data is based on a query of other tables. In this case, you could create an MV that is based on a query that averages the historical data. This query only gets run once if the underlying data does not change. Then the dashboard could just do a simple read query on this MV instead of running the costly query on the underlying table.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top