Question

It is popular to save all versions of posts when editing (like in stackexchange projects), as we can restore old versions. I wonder what is the best way to save all versions.

Method 1: Store all versions in the same table, and adding a column for order or active version. This will makes the table too long.

Method 2: Create an archive table to store older versions.

In both methods, I wonder how deals with the row ID which is the main identifier of the article.

Was it helpful?

Solution

The "best" way to save revision history depends on what your specific goals/constraints are -- and you haven't mentioned these.

But here some thoughts about your two suggested methods:

  • create one table for posts, and one for post history, for example:

    create table posts (
      id int primary key,
      userid int
    );
    
    create table posthistory (
      postid int,
      revisionid int,
      content varchar(1000),
      foreign key (postid) references posts(id),
      primary key (postid, revisionid)
    );
    

(Obviously there would be more columns, foreign keys, etc.) This is straightforward to implement and easy to understand (and easy to let the RDBMS maintain referential integrity), but as you mentioned may result in posthistory have too many rows to be searched quickly enough.

Note that postid is a foreign key in posthistory (and the PK of posts).

  • Use a denormalized schema where all of the latest revisions are in one table, and previous revisions are in a separate table. This requires more logic on the part of the program, i.e. when I add a new version, replace the post with the same id in the post table, and also add this to the revision table.

(This may be what SE sites use, based on the data dump in the SE Data Explorer. Or maybe not, I can't tell.)

For this approach, postid is also a foreign key in the posthistory table, and the primary key in the posts table.

OTHER TIPS

In my opinion, a interesting approach is

  • to define another table, for example posts_archive (it will contain all columns of posts table + an auto-incremented primary key + optionally a date...)
  • to feed this table through after-insert and after-updates triggers defined on posts table.

If the size of the table is an issue, then the second option would be the better choice. That way the active version can be returned quickly from a smaller table, and restoring an older version from the larger archive table is accepted to take longer. That said, the size of the table should not be an issue with a sensible database and indexing.

Either way, you need a primary key that consists of multiple table columns instead of just row ID. The trivial answer would be to include a timestamp containing the time each revision was created into the key, so that ID continues to identify a specific article, and ID and revision time together identify a specific revision of the article.

Dealing with temporal data is a known problem.

The method 1 simply changes your table identifier: you will end up with a table containing messageID, version, description, ... with a primary key messageID, version. Modifying the data is done by simply adding a row with an incremented version. Querying is a little bit more complicated.

The method 2 is more tedious, you will end up with a table with a rowID and a second table that is exactly the same as in the method 1. Then, on every update, you will have to remember to copy the data into the "backup table".

The method 3: answser given by Matt

In my opinion, method 1 and 3 are better. The schema is simplier in 1, but you can have unversionned data for your posts using the method 3.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top