Question

I am building an analytics database (I have a firm understanding of the data and the business objectives and only basic-to-moderate database skills).

I have come across some references to building similar warehouses which implement the concept of 'calendar tables'. This makes sense and is easily enough done. Most examples I see, however, are calendar tables that limit scope to 'day'. My data will need to be analyzed down to hour-level. Possibly minutes.

My question: would an implementation of calendar tables for hour/minute-level granularity be of value in terms of space-efficiency and query/sorting speed? If so, can you recommend a table structure and population method/example?

My primary data table will contain 20+ million rows of data at any given time and typical subsets for analysis are in the 1 to 5 million range. So, as you can see, that is a lot of timestamp fields.

Was it helpful?

Solution

Calendar tables implement a space/time tradeoff. By using more space, some kinds of queries run in less time, because they can take advantage of indexes. They're safe as long as you're careful with the CHECK() constraints, and as long as you have administrative processes to take care of any constraints that your dbms doesn't support.

If your granularity is one minute, you'll need to generate about a half million rows for each year. A minimal calendar table would look like this.

2011-01-01 00:00:00
2011-01-01 00:01:00
2011-01-01 00:02:00
2011-01-01 00:03:00
2011-01-01 00:04:00

If you're doing "bucket" analysis, you might be better off with something like this.

bucket_start         bucket_end
--
2011-01-01 00:00:00  2011-01-01 00:01:00
2011-01-01 00:01:00  2011-01-01 00:02:00
2011-01-01 00:02:00  2011-01-01 00:03:00
2011-01-01 00:03:00  2011-01-01 00:04:00
2011-01-01 00:04:00  2011-01-01 00:05:00

Since SQL's BETWEEN operator includes the endpoints, you usually need to avoid using it. That's because it includes the endpoints, and it's hard to express bucket_end as "bucket_start plus one minute, minus the smallest bit of time this server can recognize". (The danger is a value that's a microsecond greater than bucket_end, but still less than the next value for bucket_start.)

If I were going to build that table, I'd probably do it like this. (Although I'd think harder about whether I should call it "calendar".)

create table calendar (
  bucket_start timestamp primary key,
  bucket_end timestamp unique,
  CHECK (bucket_end = bucket_start + interval '1' minute)
  -- You also want a "no gaps" constraint, but I don't think you 
  -- can do that in a CHECK constraint in PostgreSQL. You might
  -- be able to use a trigger that counts the rows, and compares
  -- that count to the number of minutes between min(bucket_start)
  -- and max(bucket_start). Worst case, you can always run a report
  -- that counts the rows and sends you an email.
);

The UNIQUE constraint creates an implicit index in PostgreSQL.

This query will insert one day's worth of rows (24 hours * 60 minutes) at a time.

insert into calendar
select coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n || 'minute') as interval) as bucket_start, 
       coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n + 1 || ' minute') as interval) as bucket_end
from generate_series(1, (24*60) ) n;

You can wrap that in a function to generate a year at a time. I'd probably try to commit fewer than a half million rows at a time.

It shouldn't take too long to generate 20 million rows for testing, and another 20 million rows of "calendar" minutes. Long lunch. Maybe an afternoon in the sun.

OTHER TIPS

In PostgreSQL, you can generate calendar tables of arbitrary length and granularity on the fly:

SELECT  CAST('2011-01-01' AS DATE) + (n || ' hour')::INTERVAL
FROM    generate_series(0, 23) n

This does not require recursion (as with the other systems) and is a preferred method to generate the volatile resultsets.

In the data warehouses I built I was using separate CALENDAR and TIME_OF_DAY dimensions. The first dimension has a 1 day day granularity and the second one - 1 minute granularity.

In two other cases I knew beforehand that no reporting would be required at the granularity smaller than 15 minutes. In that case for simplicity I was using a single CALENDAR dimension with 96 records per day.

I was using this approach in the Oracle warehouses so far, but I might be involved in a PostgreSQL warehouse project this summer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top