Handling nulls in Datawarehouse

https://stackoverflow.com/questions/977924

13-09-2019
|

Question

I'd like to ask your input on what the best practice is for handling null or empty data values when it pertains to data warehousing and SSIS/SSAS.

I have several fact and dimension tables that contain null values in different rows.

Specifics:

1) What is the best way to handle null date/times values? Should I make a 'default' row in my time or date dimensions and point SSIS to the default row when there is a null found?

2) What is the best way to handle nulls/empty values inside of dimension data. Ex: I have some rows in an 'Accounts' dimensions that have empty (not NULL) values in the Account Name column. Should I convert these empty or null values inside the column to a specific default value?

3) Similar to point 1 above - What should I do if I end up with a Facttable row that has no record in one of the dimension columns? Do I need default dimension records for each dimension in case this happens?

4) Any suggestion or tips in regards to how to handle these operation in Sql server integration services (SSIS)? Best data flow configurations or best transformation objects to use would be helpful.

Thanks :-)

Solution

As the previous answer states there can be many different meanings attached to Null values for a dimension, unknown, not applicable, unknown etc. If it is useful to be able to distinguish between them in your application adding "pseudo" dimension entries can help.

In any case I would avoid having either Null fact foreign keys or dimension fields, having even a single 'unknown' dimension value will help your users define queries that include a catch-all grouping where the data quality isn't 100% (and it never is).

One very simple trick I've been using for this and hasn't bitten me yet is to define my dimensions surrogate keys using int IDENTITY(1,1) in T-sql (start at 1 and increment by 1 per row). Pseudo keys ("Unavailable", "Unassigned", "Not applicable") are defined as negative ints and populated by a stored procedure ran at the beginning of the ETL process.

For example a table created as


    CREATE TABLE [dbo].[Location]
    (
        [LocationSK] [int] IDENTITY(1,1) NOT NULL,
        [Name] [varchar](50) NOT NULL,
        [Abbreviation] [varchar](4) NOT NULL,
        [LocationBK] [int] NOT NULL,
        [EffectiveFromDate] [datetime] NOT NULL,
        [EffectiveToDate] [datetime] NULL,
        [Type1Checksum] [int] NOT NULL,
        [Type2Checksum] [int] NOT NULL,
    ) ON [PRIMARY]

And a stored procedure populating the table with


Insert Into dbo.Location (LocationSK, Name, Abbreviation, LocationBK, 
                      EffectiveFromDate,  Type1Checksum, Type2Checksum)
            Values (-1, 'Unknown location', 'Unk', -1, '1900-01-01', 0,0)

I have made it a rule to have at least one such pseudo row per dimension which is used in cases where the dimension lookup fails and to build exception reports to track the number of facts which are assigned to such rows.

OTHER TIPS

Either NULL or a reserved id from your date dimension with appropriate meaning. Remember NULL really can have many different meanings, it could be unknown, inapplicable, invalid, etc.
I would prefer empty string (and not NULLable), but in the project I'm working on now converts empty string to NULL and allows them in the database. A potential problem to be discussed is that a blank middle initial (no middle name, so middle initial is known to be empty) is different from an unknown middle initial or similar semantics. For money, our model allows NULLs - I have a big problem with this in the facts, since typically, they really should be 0, they are always used as 0 and they always have to be wraped with ISNULL(). But because of the ETL policy of converting empty string to NULL, they were set to NULL - but this was just an artifact of the fixed-width transport file format which had spaces instead of 0 from some source systems.
Our fact tables usually have a PK based on all the dimensions, so this wouldn't be allowed - it would be linked to a dummy or unknown dimension
In SSIS I made a trim component which trims spaces from the ends of all strings. We typically had to do a lot of date validation and conversion in SSIS, which would have been best in a component.

Thanks for the input,

Two things I have done on my latest project are:

1) Used Steve's suggestion about negative ID keys for Unknown/special dimension values. This has worked perfectly and no issues arose during the SSAS cube building process.

2) Created transformations to check if a value is null, and if so, convert to either -1 (Unknown record in dimension) OR if it's a measure value, convert to 0. The expressions are shown below as examples (I used these in Derived column transformations):

ISNULL(netWeight) ? 0 : netWeight // This is an example of a Measure column
ISNULL(completeddateid) ? -1 : completeddateid // This is an example of a dimension key column

Hopefully this helps someone else in the future ;-)

Another solution i can suggest is that during the ETL-step a transfer table is defined into which imported records are temporarily stored AFTER all the necessary transformations. I would add a few extra attributes to that transfer table allowing someone; next to the original value-attributes that can be NULL or some other not-desired value; to insert a "coded" value identifying the problem on the one hand and the attribute-name in which the erroneous value occurred.

Having done that i could still decide how to use the denormalized and transferred data in a later step... possibly filtering out the erroneous values or mentioning them in a separate error-dimension for inclusion in reports stating which values were deviant and how they can/could possibly affect the aggregated values.

e.g.

error-code attribute= -1 = NULL date -2 = NULL numerical value -3 = NULL PK -4 = NULL text value

and the other attribute = IdOrder, BirthDate, OrderAmount, etc.

Of course you are in a lot more trouble if records can have MORE than 1 erroneous (NULL) value, but in that case one could either expand the number of "tracing" attributes or "return to source" and find out where and why the problem occured (together with development dep.)

It is somewhat an involved step, however for the sake of completeness and correctness i suppose it's inevitable and necessary because otherwise one might be confronted with badly aggregated information.

Maybe this too will help someone ;)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow