Question

I'm having some trouble deciding what should go in a particular dimension and what should go in a fact table for a star schema I'm developing.

For the sake of example, let's say that the project is keeping track of houses for a property management company. Dimensions like the various dates, renter, contract, etc. are all fairly straightforward. For the house, no matter where the data lives, we want to keep track of the current owner, the current renter, the current rental contract, as well as things like neighborhood, address, current rental price, current market value, and so forth. Note that owner, renter and contract are themselves dimensions (and neighborhood and address may also be dimensions, but I don't care about those so much).

A lot of the data kept about houses will be used in filtering queries, or for the row and column headers of a cube. Some of it is only needed as ancillary information, looked at on a house by house basis, but not in aggregate.

Given the data, and what I need to do with it, I have (at least) three options:

  • DimHouse: house table is a dimension, with a lot of attributes that might look better in a fact table, but since they are used for browsing and filtering, they need to be here. Snowflaking/outriggers will be required for attributes like current renter.
  • FactHouse: have an accumulating snapshot of house information that is joined to other fact tables, perhaps using a trimmed down DimHouse as a bridge. This seems weird to me, but it puts what appear to be facts in a fact table.
    • Put current owner, current renter, etc. in the relevant fact tables and then keep those facts up to date as the owner/renter/etc. change (also weird, but would keep us in star schema land).

So I've been going down the dimension route. It gives me some heart burn, but it achieves the goal. I just want to know if there is a better way to organize the data. I don't mind redundancy (such as having a fact table and a dimension table with similar data) or snowflaking, if they make sense and are the best way to do things (for values of "best").

Was it helpful?

Solution

The thing about star schema is that it is purpose-built for making certain kinds of queries easy and efficient.

If you're finding that some kinds of queries aren't being helped along by your star because of what is a dimension and what is a fact, then you build additional stars around the alternative views of dimensions and facts that will more easily support the queries you want to perform.

You keep your transactional database normalized. When it comes to your BI datawarehouse you need to let your redundancy anxiety go to avoid the heart burn.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top