Deriving and saving the historical values into a separate table, or calculate the historical values from the existing data only when they're needed?

StackOverflow https://stackoverflow.com/questions/12616653

Question

tl;dr general question about handling database data and design:

Is it ever acceptable/are there any downsides to derive data from other data at some point in time, and then store that derived data into a separate table in order to keep a history of values at that certain time, OR, should you never store data that is derived from other data, and instead derive the required data from the existing data only when you need it?

My specific scenario:

We have a database where we record peoples' vacation days and vacation day statuses. We track how many days they have left, how many days they've taken, and things like that.

One design requirement has changed and now asks that I be able to show how many days a person had left on December 31st of any given year. So I need to be able to say, "Bob had 14 days left on December 31st, 2010".

We could do this two ways I see:

  1. A SQL Server Agent job that, on December 31st, captures the days remaining for everyone at that time, and inserts them into a table like "YearEndHistories", which would have your EmployeeID, Year, and DaysRemaining at that time.

  2. We don't keep a YearEndHistories table, but instead if we want to find out the amount of days possessed at a certain time, we loop through all vacations added and subtracted that exist UP TO that specific time.

I like the feeling of certainty that comes with #1 --- the recorded values would be reviewed by administration, and there would be no arguing or possibility about that number changing. With #2, I like the efficiency --- one less table to maintain, and there's no derived data present in the actual tables. But I have a weird fear about some unseen bug slipping by and peoples' historical value calculation start getting screwed up or something. In 2020 I don't want to deal with, "I ended 2012 with 9.5 days, not 9.0! Where did my half day go?!"

One thing we have decided on is that it will not be possible to modify values in previous years. That means it will never be possible to go back to the previous calendar year and add a vacation day or anything like that. The value at the end of the year is THE value, regardless of whether or not there was a mistake in the past. If a mistake is discovered, it will be balanced out by rewarding or subtracting vacation time in the current year.

Était-ce utile?

La solution

Yes, it is acceptable, especially if the calculation is complex or frequently called, or doesn't change very often (eg: A high score table in a game - it's viewed very often, but the content only changes on the increasingly rare occasions when a player does very well).

As a general rule, I would normalise the data as far as possible, then add in derived fields or tables where necessary for performance reasons.

In your situation, the calculation seems relatively simple - a sum of employee vacation days granted - days taken, but that's up to you.

As an aside, I would encourage you to get out of thinking about "loops" when data is concerned - try to think about the data as a whole, as a set. Something like

SELECT StaffID, sum(Vacation)
from
(
    SELECT StaffID, Sum(VacationAllocated) as Vacation 
    from Allocations
    where AllocationDate<=convert(datetime,'2010-12-31' ,120)
    group by StaffID
    union
    SELECT StaffID, -Count(distinct HolidayDate) 
    from HolidayTaken
    where HolidayDate<=convert(datetime,'2010-12-31' ,120)
    group by StaffID
) totals
group by StaffID

Autres conseils

Derived data seems to me like a transitive dependency, which is avoided in normalisation. That's the general rule.
In your case I would go for #1, which gives you a better "auditability", without performance penalty.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top