Question

I am currently working on a project that collects a customers demographics weekly and stores the delta (from previous weeks) as a new record. This process will encompass 160 variables and a couple hundred million people (my management and a consulting firm requires this, although ~100 of the variables are seemingly useless). These variables will be collected from 9 different tables in our Teradata warehouse.

I am planning to split this into 2 tables.

  1. Table with often used demographics (~60 variables sourced from 3 tables)
    • Normalized (1 customer id and add date for each demographic variable)
  2. Table with rarely or unused demographics (~100 variables sourced from 6 tables)
    • Normalized (1 customer id and add date for each demographic variable)

MVC is utilized to save as much space as possible as the database it will live on is limited in size due to backup limitations. (to note the customer id currently consumes 30% (3.5gb) of the table 1's size, so additional tables would add that storage cost)

The table(s) will be accessed by finding the most recent record in relation to the date the Analyst has selected:

SELECT cus_id,demo
    FROM db1.demo_test 
    WHERE (cus_id,add_dt) IN (
        SELECT cus_id, MAX(add_dt) 
            FROM db1.dt_test 
            WHERE add_dt <= '2013-03-01'  -- Analyst selected Point-in-Time Date
         GROUP BY 1)
GROUP BY 1,2

This data will be used for modeling purposes, so a reasonable SELECT speed is acceptable.

  1. Does this approach seem sound for storage and querying?
    • Is any individual table too large?
  2. Is there a better suggested approach?
    • My concern with splitting further is
      • Space due to uncompressible fields such as dates and customer ids
      • Speed with joining 2-3 tables (I suspect an inner join may use very little resources.)

Please excuse my ignorance in this matter. I usually work with large tables that do not persist for long (I am a Data Analyst by profession) or the tables I build for long term data collection only contain a handful of columns.

Was it helpful?

Solution

Additional to Rob's remarks:

What is your current PI/partitioning?

Is the current performance unsatisfactory?

How do the analysts access beside the point-in-time, any other common conditions?

Depending on your needs a (prev_dt, add_dt) might be better than a single add_dt. More overhead to load, but querying might be as simple as date ... between prev_dt and end_dt.

A Join Index on (cus_id), (add_dt) might be helpful, too.

You might replace the MAX(subquery) with a RANK (MAX is usually slower, only when cus_id is the PI RANK might be worse):

SELECT *
FROM db1.demo_test 
QUALIFY 
  RANK() OVER (PARTITION BY cus_id ORDER BY add_dt DESC) = 1

In TD14 you might split your single table in two row-containers of a column-partitioned table.

...

OTHER TIPS

The width of the table at 160 columns, sparsely populated is not necessarily an incorrect physical implementation (normalized in 3NF or slightly de-normalized). I have also seen situations where attributes not regularly accessed are moved to a documentation table. If you elect to implement the latter in your physical implementation it would be in your best interest that each table share the same primary index. This allows the joining of these to tables (60 attributes and 100 attributes) to be AMP-local on Teradata.

If the access of the table(s) will also include the add_dt column you may wish create a partitioned primary index on this column. This will allow the optimizer to eliminate the other partitions from being scanned when the add_dt column is included in the WHERE clause of a query. Another option would be to test the behavior of a value ordered secondary index on the add_dt column.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top