Question

I'm having a dilemma. I'm working with a lot of legacy code and I see a lot of redundant information in the table structures. Primarily they exist in two forms:

A. Redundant information to save on 'joins'. eg:

event_id, event_name, event_creator_id
       3       test1                43

subevent_id, event_id, event_creator_id 
         21         3               43

Note the duplicacy of the event_creator_id. The rationale given by the erstwhile 'senior' developers is that when we need the event creator id, we just have to query one table and not do an 'expensive' join to retrieve the value.

B. Redundant information to save on calculations. eg:

event_id, event_default_price
       3                  100

discount_id, discount_code, discount_percentage
          7,          ABCD,                  50

special_event_id, event_id, discount_id, discounted_price
              21         3            7,               50

Note that instead of calculating the final 'discounted_price' for this special event (because the reference to the discount_id exists already), the code is saving that 'calculated' value as it is here. Again, the justification is 'speed', normalcy shot to hell.

I have two questions:

  1. I can tell new developers that these structures are not normalized, but they can say its faster. How do I counter that? Do I counter that? Do others structure their databases like this?!
  2. Is there a rule of thumb, or a set of principles which I can use to say that - 'oh, it will be slower, but only by 1%, so its okay to do it this way', etc?
Was it helpful?

Solution

About your two questions:

I can tell new developers that these structures are not normalized, but they can say its faster. How do I counter that? Do I counter that? Do others structure their databases like this?!

It may be faster, but it's not necessarily so: whenever you decide to add extra information to a table (the extra fields, in your case) you are also adding a performance penalty because the table gets bigger, which may mean more data travelling from server to clients, or to be paged in or out of memory... also if the field is there to speed up queries it will probably have one or more index on this, which again has a performance penalty during updates and inserts. The main point, though, is the one I hinted at in my comment: "cached" and "precomputed" values make the system more fragile in terms of data integrity. Are you sure that "event_creator_id" is always correctly pointing to the real creator, even in case someone has amended the original value? If yes, this has costs too, both in terms of computation (you have to update all the tables when the creator is changed) and in terms of actual development and testing effort (are you sure nobody forgot to propagate changes to the precomputed fields?).

Same goes for aggregate values like "discounted price" or running totals... and changing the original data is probably much more frequent than changing the "event creator" information. Again, is there a proper "cache invalidation" mechanism in place to ensure that the total sales are recomputed whenever someone completes a sale? What about returned item? Has anyone considered the cost of ensuring integrity?

Running totals and other derived values should be implemented by using views, instead, so that caching, if any, is performed by the actual DBMS engine, who knows how to properly take care of this.

Is there a rule of thumb, or a set of principles which I can use to say that - 'oh, it will be slower, but only by 1%, so its okay to do it this way', etc?

A DB (or arguably any kind of computing system) should be "correct first" so that you can find how to make it "fast enough, second". Trading correctness for speed is a decision that you should not take when designing the DB, unless you already know that the timeliness is considered way more important than correctness. I.e. your requirements clearly state that having a possibly wrong or outdated information is less important than response time.

In other words: designing a table with redundant cached info is another example of premature optimization and should be avoided at all costs.

See also this - especially the answers

OTHER TIPS

Any db book I have read on relational design has always included a section on "planned" redundancy or "limited" de-normalization. It depends on the environment. Wells Fargo pre-calculates bank statement totals and stores the pre-calcs.

Imagine how long it would take to do those calculations if they waited to the end of each cycle when they go to print the statement.

Planned redundancy is Normal!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top