Question

In our time dimension in data warehouse, we have a lot of columns with boolean flags, for example:

  • is_ytd (is year to date)
  • is_mtd (is month to date)
  • is_current_date
  • is_current_month
  • is_current_year

Would it be a good indexing strategy to create partial index on all such columns? Something like:

CREATE INDEX tdim_is_current_month
  ON calendar (is_current_month)
  WHERE is_current_month;

Our time dimension has 136 columns, 7 000 rows, 53 columns with boolean indicator.

Why we use flags instead of deriving desired date range from current_date?

  1. Make life easier
  2. Enforce consistency
  3. Speed-up queries
  4. Provide not-so-easy-to-derive indicators
  5. Make use of other tools easier

Ad1) Once you join time dimension (and this is almost always when analyzing any fact table in data warehouse), it is much easier to write where is_current_year instead of where extract(year from time_date) = extract(year from current_date)

Ad2) Example: It sounds simple to figure out what year to date (YTD) is. We can start with: time_date between date_trunc('year', current_date) and current_date. But some people would actually exclude current_date (this make sense, because today is not finished). In such case we would use: time_date between date_trunc('year', current_date) and (current_date - 1). And going further - what would happen if for some reason DW is not updated for couple of days. Maybe then you would like YTD linked to day where you have last completed data from all source systems. When you have common definition of what YTD means than you reduce risk of different meanings.

Ad 3) I think that it should be faster to filter data based on indexed boolean flag in column than filter based on on-the-fly calculated expression.

Ad 4) Some flags are not so easy to create - for example we do have flags is_first_workday_in_month, is_last_workday_in_month.

Ad 5) In some tools it is easier to use existing columns than SQL expressions. For example when creating dimensions for OLAP cube is is much easier to add table column as level of hierarchy than constructing such level with SQL expression.

Testing indexes for boolean flags

I tested all indexed flags and run explain analyze for simmple query with one fact table and time dimension (named calendar):

select count(*) from fact_table join calendar using(time_key)

For most flags I get Index scan:

"Aggregate  (cost=4022.80..4022.81 rows=1 width=0) (actual time=38.642..38.642 rows=1 loops=1)"
"  ->  Hash Join  (cost=13.12..4019.73 rows=1230 width=0) (actual time=38.640..38.640 rows=0 loops=1)"
"        Hash Cond: (fact_table.time_key = calendar.time_key)"
"        ->  Seq Scan on fact_table  (cost=0.00..3249.95 rows=198495 width=2) (actual time=0.006..17.769 rows=198495 loops=1)"
"        ->  Hash  (cost=12.58..12.58 rows=43 width=2) (actual time=0.054..0.054 rows=43 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 2kB"
"              ->  Index Scan using cal_is_qtd on calendar  (cost=0.00..12.58 rows=43 width=2) (actual time=0.014..0.049 rows=43 loops=1)"
"                    Index Cond: (is_qtd = true)"
"Total runtime: 38.679 ms"

For some flags I get bitmap heap scan combined with bitmap index scan:

"Aggregate  (cost=13341.07..13341.08 rows=1 width=0) (actual time=100.972..100.973 rows=1 loops=1)"
"  ->  Hash Join  (cost=6656.54..13001.52 rows=135820 width=0) (actual time=5.729..86.972 rows=198495 loops=1)"
"        Hash Cond: (fact_table.time_key = calendar.time_key)"
"        ->  Seq Scan on fact_table  (cost=0.00..3249.95 rows=198495 width=2) (actual time=0.012..22.667 rows=198495 loops=1)"
"        ->  Hash  (cost=6597.19..6597.19 rows=4748 width=2) (actual time=5.706..5.706 rows=4748 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 158kB"
"              ->  Bitmap Heap Scan on calendar  (cost=97.05..6597.19 rows=4748 width=2) (actual time=0.440..4.971 rows=4748 loops=1)"
"                    Filter: is_past_quarter"
"                    ->  Bitmap Index Scan on cal_is_past_quarter  (cost=0.00..95.86 rows=3249 width=0) (actual time=0.395..0.395 rows=4748 loops=1)"
"                          Index Cond: (is_past_quarter = true)"
"Total runtime: 101.013 ms"

Only for two flags I get seq scan:

"Aggregate  (cost=17195.33..17195.34 rows=1 width=0) (actual time=122.108..122.108 rows=1 loops=1)"
"  ->  Hash Join  (cost=9231.13..16699.10 rows=198495 width=0) (actual time=23.960..108.018 rows=198495 loops=1)"
"        Hash Cond: (fact_table.time_key = calendar.time_key)"
"        ->  Seq Scan on fact_table  (cost=0.00..3249.95 rows=198495 width=2) (actual time=0.012..22.153 rows=198495 loops=1)"
"        ->  Hash  (cost=9144.39..9144.39 rows=6939 width=2) (actual time=23.935..23.935 rows=6939 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 231kB"
"              ->  Seq Scan on calendar  (cost=0.00..9144.39 rows=6939 width=2) (actual time=17.427..22.908 rows=6939 loops=1)"
"                    Filter: is_eoq"
"Total runtime: 122.138 ms"
Was it helpful?

Solution

If is_current_month = true represents more than a few percent of the rows then the index will not be used. 7,000 rows is too few to even bother.

OTHER TIPS

Maybe this is more of a comment than an answer....

Given that the query planner/optimizer gets the cardinalities and join type correct, the execution time of any query involving a join between your fact table and your time dimension will be determined by the size of the fact table.

Your time dimension will either be in cache all the time or fully read in a few ms. You will have bigger variations than that depending on the current load! The rest of the execution time does not have to do with the time dimension.

Having said that, I'm all for using every trick in the bag to help the query planner/optimizer come up with good enough estimates. Sometimes this means creating or disabling constraints and creating unnecessary indexes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top