Data Warehouse: Working with accumulated data

https://stackoverflow.com/questions/13913957

10-12-2021
|

题

Our data warehouse uses accumulated data (and there is no way to invert the accumulation) from a data source to create a snowflake schema. A requirement we have to cope with is that our schema must be usable to create reports based on date ranges.

Our schema looks like this (simplified):

+------------------------------------------+
| fact                                     |
+-------+-----------------+----------------+
|    id | statisticsDimId | dateRangeDimId |
+-------+-----------------+----------------+
|     1 |               1 |             10 |
|     2 |               2 |             11 |
|     3 |               3 |             12 |
|     4 |               4 |             13 |
|     5 |               5 |             14 |
|     6 |               5 |             15 |
|     7 |               5 |             16 |
|   ... |             ... |            ... |
| 10001 |            9908 |             11 |
| 10002 |            9909 |             11 |
+-------+-----------------+----------------+

+-------------------------------------------------+
| date_range_dimension                            |
+-------+-----------------------------------------+
|    id | startDateTime      | endDateTime        |
+-------+--------------------+--------------------+
|    10 | '2012-01-01 00:00' | '2012-01-01 23:59' |
|    11 | '2012-01-01 00:00' | '2012-01-02 23:59' |
|    12 | '2012-01-01 00:00' | '2012-01-03 23:59' |
|    13 | '2012-01-01 00:00' | '2012-01-04 23:59' |
|    14 | '2012-01-01 00:00' | '2012-01-05 23:59' |
|    15 | '2012-01-01 00:00' | '2012-01-06 23:59' |
|    16 | '2012-01-01 00:00' | '2012-01-07 23:59' |
|    17 | '2012-01-01 00:00' | '2012-01-08 23:59' |
|    18 | '2012-01-01 00:00' | '2012-01-09 23:59' |
|   ... |                ... |                ... |
+-------+--------------------+--------------------+

+-----------------------------------------------------+
| statistics_dimension                                |
+-------+-------------------+-------------------+-----+
|    id | accumulatedValue1 | accumulatedValue2 | ... |
+-------+-------------------+-------------------+-----+
|     1 |    [not relevant] |    [not relevant] | ... |
|     2 |    [not relevant] |    [not relevant] | ... |
|     3 |    [not relevant] |    [not relevant] | ... |
|     4 |    [not relevant] |    [not relevant] | ... |
|     5 |    [not relevant] |    [not relevant] | ... |
|     6 |    [not relevant] |    [not relevant] | ... |
|     7 |    [not relevant] |    [not relevant] | ... |
|   ... |    [not relevant] |    [not relevant] | ... |
|   ... |    [not relevant] |    [not relevant] | ... |
| 10001 |    [not relevant] |    [not relevant] | ... |
| 10002 |    [not relevant] |    [not relevant] | ... |
+-------+-------------------+-------------------+-----+

We want to create our report data set with something like this:

SELECT *
    FROM fact
INNER JOIN statistics_dimension
    ON (fact.statisticsDimId = statistics_dimension.id)
INNER JOIN date_range_dimension
    ON (fact.dateDimId = date_range_dimension.id)
WHERE
    date_range_dimension.startDateTime = [start]
AND
    date_range_dimension.endDateTime = [end]

The problem is that the data in our statistics dimension is already accumulated and we cannot invert the accumulation. We calculated the approximated number of rows in our fact table and got 5,250,137,022,180. There are about 2,5 million date range permutations for our data and we need to calculate them into our date dimension and fact table because of the accumulation. SQL's SUM function does not work for us due to the accumulation (you cannot add two values that belong to non-distinct sets).

Is there a best practice we could follow to make it computationally feasible? Is there something wrong with our schema design?

We need to report data about online trainings. The data source is a legacy data provider with parts that are older than 10 years - so nobody can reconstruct the internal logic. The statistics dimension contains - for example - the progress (in %) a user accomplished in a web-based training (WBT), the number of calls per WBT page, the status of a WBT (for a user, e.g. "completed"), a.s.o.. The important thing about the data provider is: It just gives us a snapshot of the current state. We don't have access to historic data.

解决方案

I'm assuming you are on some pretty strong hardware for this. Your design has one major drawback - the join between the fact table and the "statistics" dimension.

Generally, a fact table contains dimensions and measures. It looks to me like it's likely there's a 1-1 relationship between your "statistics" dimension and your fact table. Since fact tables are essentially a "Many-Many" relationship table, it doesn't make sense to have your stats on a separate table. In addition, you say the stats table has information "by user".

Any time you say "By X" in warehousing, you can almost always be sure that X should be a dimension.

I would see about building your fact table with the measures directly on it. I'm not sure what you're trying to do with "inverting" the accumulation on the stats table? Do you mean it is accumulated across date ranges? Users? If the data's not atomic, the best you can do is give what you have...

其他提示

You can reduce the number of dimensions needed to calculate this task by:

adding time dimension with daily granularity and not using your current design
merging statistics dimension with fact table

In our current data warehouse we are using the following approach:

time_dimension
 time_key (bigint)
 time_date (date)
 (other time related columns)

fact_table
 (keys to other dimensions)
 time_key_start (bigint) /* reference to time_dimension, time_key */
 time_key_end (bigint)   /* reference to time_dimension, time_key */
 value_1
 value_2

Furthermore, keys in time_dimension are "smart". I know lot of people won't agree with such design, but when performance has to be improved, we can reduce number of dimensions used in query by querying time_key directly, with condition like:

time_key_start = to_char('2012-01-01','J')::bigint
and
time_key_end = to_char('2012-01-02','J')::bigint

With such design you can avoid all joins in your query. Then you have to focus on table partitions and indexes to improve performance.

Maybe, it is also not needed to analyze the whole history of data and you can move some data to archive.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow