Question

I have a table like

FieldA, FieldB, FieldC, FieldD, TheDate, Count

I have a web app which provides a dashboard of the top few Count by each of the fields. Not wanting to prematurely optimize, the original, brute force queries for these charts are like:

SELECT TOP 10
    FieldA,
    SUM(Counts) AS Counts
FROM TheTable
WHERE @StartDate <= TheDate AND TheDate <= @EndDate
GROUP BY FieldA
ORDER BY Counts

And the same for the other fields. But the server ends up selecting by date range independently for each chart and when there is a lot of data, the system is bogging down.

It seems wrong to get all the data in the app (once) then do the summary by the columns locally. And maybe the RDBMS caches a lot of the records so the second through fourth chart are more efficient that the first.

I'm using Azure SQL and neither SQL Server Management Studio nor DataGrip suggest any missing indices which could help.

Any thoughts on techniques to do multiple, similar options on the same data? In general or for Azure SQL. Thanks.

Was it helpful?

Solution

suggest any missing indices which could help

You need to show what indexes you currently, and the size & shape of your data (approx how many rows historically?, and how many new are added each day/hour/other?, how much do fields vary?, how often are they NULL?), have defined for us to give absolutely relevant advice.

Any thoughts on

As most of your queries follow the stated pattern, you are always performing a query on a range of dates so at very least you need an index on [date]. Further more it would probably be a good idea for this to be your clustered index[†] to reduce the page accesses needed for such range queries.

If you can afford the extra space (and the extra RAM needed for your common working set to stay in there to avoid IO thrashing) then you may get a noticeable boost from having indexes on [date],FieldA, [date],FieldB, ... so that the grouping does not have to perform a sort operation (once the data is found by date it is already in order in the index used). If there are particular fields that are queried for much more often than the others, then perhaps just do this to help the queries on those fields instead of spending the resources doing it for all of them.

[†] even if you have a unique integer as your primary key[‡] (or something else like a UUID)
[‡] and you should have a surrogate key in this example otherwise you could have rows that are otherwise identical which doesn't fit the relational model and could cause issues

As a side-note: date is a keyword, as it is a type, so I would avoid using that as a column name even in examples.

OTHER TIPS

It will be hard to guess without much information provided related to data as mentioned by David and probably why there is no indication of missing indexes.

One more point that I can guess why there would be no missing index suggestion is the plan optimization may be trivial. Just a guess because you have not uploaded the execution plan. Probably you also want to share that if applicable.

You can index on quite a few column depending upon how selective data is spread across those columns.

May be indexing the column in order by with leading key more beneficial than one as date which is an inequality predicate

You can have the one in select and group as another index or combined from first depending upon what metrics are seen.

Therefore there are lot of possibilities. We just don’t know what works best based on data provided to us.

This may be a good use case for a indexed view. You can create one view per column and use the date as the leading column for the unique index.

There is a performance overhead to updating each indexed view, but it's almost certainly less than however many indexes would be required on base table.

There's some work to be done so your front end queries the data correctly, but if the datatypes for the columns are similar enough (or you'll be displaying them all as character anyway) you can always change the structure to something like (ColumnName,ColumnValue,DateColumn,ColumnCount) by adding a fixed value for ColumnName to each indexed view. Then you can stack those views like so:

CREATE VIEW StackedSummary_V AS
SELECT
  ColumnName
 ,ColumnValue = CAST(ColumnA AS VARCHAR(MAX)) --Can use maximum value width instead of MAX
 ,DateColumn
 ,ColumnCount
FROM
  ColumnAView
WHERE
  ColumnName = 'ColumnA' --This is needed so SQL Server can do view elimination

  UNION ALL

SELECT
  ColumnName
 ,ColumnValue = CAST(ColumnB AS VARCHAR(MAX)) --Can use maximum value width instead of MAX
 ,DateColumn
 ,ColumnCount
FROM
  ColumnBView
WHERE
  ColumnName = 'ColumnB'

  UNION ALL
...<etc>
 

Then your dashboard query can be something like:

SELECT
  ColumnName
 ,ColumnValue
 ,DateColumn
 ,ColumnCount
FROM
  StackedSummary_V
WHERE
  ColumnName = 'ColumnA'
    AND DateColumn <= <whatever>
    AND DateColumn >= <whatever>
Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top