Fact table - select distinct?

https://stackoverflow.com/questions/19292245

30-06-2022
|

Question

in my storage data model I got the following relations:

root_tbl -- 1:n -- entry_tbl -- n:1 -- action_tbl

There are a few more tables but this covers the basics. Alright, so basically one ID from the root table has several datasets in the entry table.

Example data:

root_tbl:

ID_root ; Country ; FK_User ; FK_Product
      1 ;      UK ;      23 ;      31
      2 ;      NL ;      42 ;      01


entry_tbl:

ID_entry ; FK_root ; FK_Action ; Duration
       1 ;       1 ;        42 ; 200ms
       2 ;       1 ;        10 ; 94ms
       3 ;       1 ;         9 ; 300ms
       4 ;       2 ;        10 ; 322ms
       5 ;       2 ;        30 ; 100ms

So far so good ... with this data model it is pretty easy to answer things like how many records have "UK" as country with action "10" and so on. Now I would like to put this data into a fact table but my problem are the relations of these three tables. For example would I use the records of entry_tbl as fact than I would have to do a select distinct on ID everytime I count country, user or product.

The fact table would look more or less like this (just imagine the strings as foreign keys ):

fact_tbl:

ID ; FK_Action ; Duration ; Country ; User ; Product
1  ;        42 ;    200ms ;      UK ;   23 ;      31
1  ;        10 ;     94ms ;      UK ;   23 ;      31
1  ;         9 ;    300ms ;      UK ;   23 ;      31
2  ;        10 ;    322ms ;      NL ;   42 ;      01
2  ;        30 ;    100ms ;      NL ;   42 ;      01

This means I would have a lot of redundant data.

Are there any way around these solution? The fact table would contain ~ 300 - 500m rows.

I hope you got my point. If anything is not clear feel free to ask

regards Thomas

Solution

Well it's usual to perform an aggregation on a fact table, in which case a distinct would be moot.

Here you need to use count(distinct) to count the number of ID's, but that is what a data warehouse is for. Similarly you might have to run a Sum(duration), or a count(distinct user), or a count(distinct product).

I don't think you have a design problem, you just have to ensure that you have enough available memory for your group by operations to run as far as possible without disk-based sorting. Monitor large queries through V$SQL_WORKAREA_ACTIVE, monitor the SGA and PGA cache advisors, and adjust the memory allocation if required.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow