Question

I would like to do

SELECT * TABLE t (using data from 1st of march 2012)

I already have a nice audit trail of all tables in the database. It basically makes a copy of all rows that change, storing all columns of the changed row in an hstore column. This means that we have the data available, but I am unsure of whether it is the best way to store the data for a database time machine.

To give you more context, we are creating accounting software. This means that we need to be able to recreate all reports which we have offered to the customer (they can see them on our website and get updated continuously).

What sorts of problems do you anticipate that I will run into with this approach? Is there a better approach?

Some facts

  • Each row in the main table gets edited 5 times on average.
  • The main table has 32 columns (could get reduced to only 7 that need to have an audit trail)
  • We will only ever have 1 million users on our software, each will have ~700 rows in the main table.
Was it helpful?

Solution

Here is my favorite approach:

  • Each table has a corresponding history table
  • Write stored procedures (or triggers) to make sure that all actions are logged to the history tables
  • On insert, add a row to the history table with start = now() and end = 31.12.2999
  • On update, first update the most recent history record to end = now(). Then insert a new row with start = now() and end = 31.12.2999
  • On delete, update the most recent history record to end = now().

Now you can write a point-in-time query, even with joins:

select g.groupname, p.productname, p.price 
from products_hist p, product_groups_hist g 
where p.id = g.id 
and p.start <= now() and now() < p.end
and g.start <= now() and now() < g.end

OTHER TIPS

I would like to offer a different approach than the accepted answer based on event sourcing + cqrs.

Event sourcing is a different approach than nor relational design. You basically have only one table lets call it events. An event Is something that happened in the past tense. I.e AccountCreated MoneyDeposited. These events are classes that you serialize to text with your favorite serializer. The events have a defined ordering which means we can build something on top.

CQRS means command query resposibility segregation. It means we have two models. One for writing and one for reading. Lets focus on the write model. Every time you want to change your object you save an event to the stream. Lets say we save a MoneyDeposited event. We make an event handler inside our class. A ultra simple example could look like this(some infrastructure missing).

class BankAccount{
    long Balance{get;set;}
    void Deposit(long amount){SaveEvent(new MoneyDeposited(AccountId,amount); }
    void Apply(MoneyDeposited m){balance += m.Amount; }
}

Since we now have split the apply method out we can read an entire stream from db which just calls Apply().

This means we can easily check the consistency rules. Lets say the WithDraw() method has insuffficent funds. It will throw before publishing the event. We can also easily use optimistic concurrency to check that two objects are being modified concurrently leading to inconsistencies since we read row n from db we increment row in memory to n+1 and thus a simple unique index will fail if some other thread wrote that row before us.

This is all simplified quite a bit, but your goal was to use a time machine and with the streams you just choose the row number you want to read and ignore anything newer.

I haven't event touched the read models yet but a long store short is that in some other storage you build a number of representations based on the events ready to eat.

Some keyworks to get you started on this. CQRS event sourcing. Try to find presentations from Gregg Young, Udi Dahan and Rinat Abdullin. Some useful frameworks: NEventStore(.NET) or Cirqus(.NET) even if it is quite young.

Licensed under: CC-BY-SA with attribution
scroll top