Strategy to keep updated summary tables standing by (SQL Server)

https://stackoverflow.com/questions/16185580

11-04-2022
|

Question

I've got a client portal project (the first one I've developed so a basic best practice is what I'm looking for here, nothing fancy) nearing first release.

A simplification of the main record types used in reporting is the following:

CREATE TABLE [dbo].[conversions](
    [conversion_id] [nvarchar](128) primary key NOT NULL,
    [click_id] [int] NULL,
    [conversion_date] [datetime] NOT NULL,
    [last_updated] [datetime] NULL,
    [click_date] [datetime] NULL,
    [affiliate_affiliate_id] [int] NOT NULL,
    [advertiser_advertiser_id] [int] NOT NULL,
    [offer_offer_id] [int] NOT NULL,
    [creative_creative_id] [int] NOT NULL,
    [conversion_type] [nvarchar](max) NULL)

CREATE TABLE [dbo].[clicks](
    [click_id] [int] primary key NOT NULL,
    [click_date] [datetime] NOT NULL,
    [affiliate_affiliate_id] [int] NOT NULL,
    [advertiser_advertiser_id] [int] NOT NULL,
    [offer_offer_id] [int] NOT NULL,
    [campaign_id] [int] NOT NULL,
    [creative_creative_id] [int] NOT NULL,
    [ip_address] [nvarchar](max) NULL,
    [user_agent] [nvarchar](max) NULL,
    [referrer_url] [nvarchar](max) NULL,
    [region_region_code] [nvarchar](max) NULL,
    [total_clicks] [int] NOT NULL)

My specific question is: given millions of rows in each table, what mechanism is used to serve up summary reports quickly on demand given you know all the possible reports that can be requested?

The starting point, performance wise, doing raw queries against a 18 months worth of data for the busiest client is yielding a 3 to 5 second latency on my dashboard and the worst case is upwards of 10 seconds for a summary report with a custom date range spanning all the rows.

I know I can cache them after the first hit, but I want snappy performance on the first hit.

My feeling is this is a fundamental aspect of an application of this nature and that there are tons of applications like this out there, so is there an already well-thought-out method to pre-calculating tables that already did the grouping and aggregation? Then how do you keep them up to date? Do you use SQL agent and custom console apps that brute force the calculations before hand?

Any general pointers would be very appreciated..

Solution

Both tables are time series. They seem to be clustered by an ID column which has little value for how time series are queried. Time series are almost always queried by date range, so your clustered organization should service this type of queries first and foremost: cluster by date, move the ID primary key constraint into a non-clustered.

CREATE TABLE [dbo].[conversions](
    [conversion_id] [nvarchar](128) NOT NULL,
    [conversion_date] [datetime] NOT NULL,
    ...
    constraint pk_conversions nonclustered primary key ([conversion_id]))
go

create clustered index [cdx_conversions] on [dbo].[conversions]([conversion_date]);
go

CREATE TABLE [dbo].[clicks](
    [click_id] [int] NOT NULL,
    [click_date] [datetime] NOT NULL,
    ...
    constraint [pk_clicks] nonclustered [click_id]);
go

create clustered index [cdx_clicks] on [dbo].[clicks]([click_date]);

This model will serve the typical queries that filter by a range on [click_date] and on [conversion_date]. For any other query the answer will be very specific to your query.

There are limits on how useful a relational row organized model can be for an OLAP/DW workload like yours. Specialized tools do a better job at it. Columnstore indexes can deliver amazingly fast responses, but they are difficult to update. Creating a MOLAP cube can also deliver blazing results but that is a serious project undertaking. There are even specialized time series databases out there.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow