What are best practices for storing many iterations of a product in an RDBMS?

https://dba.stackexchange.com/questions/278856

09-03-2021
|

سؤال

We are building a tool to track the prices of products over time, and using Postgres as our RDBMS. It is important that product attributes can be changed, and that the history of an product's attributes be preserved forever. Here is a schema we designed based on OpenStreetMap's internal schema:

We have a 'products' table on the left storing every version of every product, and a 'current_products' table on the right storing only the most recent version of each product. Every time we want to change a store, we:

create an entry in changesets
read the latest entry of the product in 'products', increment version by one, and create another entry with the changes
delete the corresponding entry in 'current_products' and create a new one with the changes and the latest version number from 'products'

We want to enforce as many business rules in the database engine as possible rather than relying on our software to keep things consistent, and this schema feels pretty "off", so we welcome any suggestions. Thanks in advance!

Edit: Revised the schema based a response from @bbaird . Also decided to include versioning of stores and users. Tied products, stores, and users together with price table.

المحلول

You're right to feel the schema is off, because it is - the way it is designed now will not guarantee the minimum criteria required for consistency: As of a point in time, only one value can exist for a given attribute.

There are two ways to handle this, depending on the use case:

Different versions of the attributes need to be accessed by the application
Changes must be tracked for audit reasons only

Solution: Case 1

You would have a Product table and a Product_Version to store the necessary information. You will need a view/function to return the proper value.

Since you are dealing with food (and a standard source), I'm going to make certain assumptions about keys/datatypes. Feel free to comment to clarify.

CREATE TABLE Product
(
  Barcode  VARCHAR(13)  NOT NULL
  /* Store all invariant attributes in this table */
 ,CONSTRAINT PK_Product PRIMARY KEY (Barcode) /* This uniquely defines a product and is compact enough - no other key is necessary */
)
;

CREATE TABLE Product_Version
(
  Barcode        VARCHAR(13)    NOT NULL
 ,Change_Dtm     TIMESTAMP(6)   NOT NULL
 ,Name           VARCHAR(50)    NOT NULL
 ,Price          DECIMAL(8,2)   NOT NULL /* Adjust as necessary */
 ,Currency_Cd    CHAR(3)        NOT NULL /* Should reference a Currency table with ISO codes (USD, EUR, GBP, etc) */
 ,Delete_Ind     CHAR(1)        NOT NULL
 ,Change_UserId  VARCHAR(32)    NOT NULL
 ,CONSTRAINT FK_Product_Version_Version_Of_Product FOREIGN KEY (Barcode) REFERENCES Product (Barcode)
 ,CONSTRAINT PK_Product_Version PRIMARY KEY (Barcode, Change_Dtm)
 ,CONSTRAINT CK_Product_Version_Price_GT_Zero CHECK (Price > 0)
 ,CONSTRAINT CK_Product_Version_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;

To get the values for a specific product as of a point in time, you would use the following query:

SELECT
  PV.Barcode
 ,PV.Name
 ,PV.Price
 ,PV.Currency_Cd
FROM
  Product_Version PV
WHERE
  PV.Barcode = '8076809513388'
    AND PV.Change_Dtm =
      (
        SELECT
          MAX(Change_Dtm)
        FROM
          Product_Version
        WHERE
          Barcode = PV.Barcode
            AND Change_Dtm <= '2020-10-29 12:30:00.000000'
      )

You can also make a view to mimic the function of a table with static values:

CREATE VIEW v_Product AS
SELECT
  PV.Barcode
 ,PV.Name
 ,PV.Price
 ,PV.Currency_Cd
FROM
  Product_Version PV
WHERE
  PV.Change_Dtm =
    (
      SELECT
        MAX(Change_Dtm)
      FROM
        Product_Version
      WHERE
        Barcode = PV.Barcode
    )

For one-to-many relationships (let's use Ingredient for this example) you would follow a pattern like so:

CREATE TABLE Product_Ingredient
(
  Barcode     VARCHAR(13)   NOT NULL
 ,Ingredient  VARCHAR(50)   NOT NULL  /* Should reference an Ingredient table */
 ,Rank        SMALLINT      NOT NULL  /* Uniqueness of this value needs to be handled through transaction logic */
 ,Change_Dtm  TIMESTAMP(6)  NOT NULL
 ,Delete_Ind  CHAR(1)       NOT NULL
 ,CONSTRAINT FK_Product_Ingredient_Used_In_Product FOREIGN KEY (Barcode) REFERENCES Product (Barcode)
 ,CONSTRAINT PK_Product_Ingredient PRIMARY KEY (Barcode, Change_Dtm)
 ,CONSTRAINT CK_Product_Ingredient_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;

Then to get a list of Ingredients for a Product at a point in time, you would use the following query:

SELECT
  PI.Barcode
 ,PI.Ingredient
 ,PI.Rank
FROM
  Product_Ingredient PI
WHERE
  PI.Barcode = '8076809513388'
    AND PI.Change_Dtm =
      (
        SELECT
          MAX(Change_Dtm)
        FROM
          Product_Ingredient
        WHERE
          Barcode = PI.Barcode
            AND Ingredient = PI.Ingredient
            AND Change_Dtm <= '2020-10-29 12:30:00.000000' /* Or whatever */
      )
    AND PI.Delete_Ind = 'N'

Similar to the prior example, you can create a view to provide the current values for each of the one-to-many relations.

Solution: Case 2

If you merely need to store history, you simply make a small modification to the structure:

CREATE TABLE Product
(
  Barcode        VARCHAR(13)    NOT NULL
 ,Name           VARCHAR(50)    NOT NULL
 ,Price          DECIMAL(8,2)   NOT NULL
 ,Currency_Cd    CHAR(3)        NOT NULL
 ,Change_UserId  VARCHAR(32)    NOT NULL
 ,Change_Dtm     TIMESTAMP(6)   NOT NULL
 ,Delete_Ind     CHAR(1)        NOT NULL
 ,CONSTRAINT PK_Product PRIMARY KEY (Barcode)
 ,CONSTRAINT CK_Product_Price_GT_Zero CHECK (Price > 0)
 ,CONSTRAINT CK_Product_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;

CREATE TABLE Product_Audit
(
  Barcode        VARCHAR(13)    NOT NULL
 ,Name           VARCHAR(50)    NOT NULL
 ,Price          DECIMAL(8,2)   NOT NULL
 ,Currency_Cd    CHAR(3)        NOT NULL
 ,Change_Dtm     TIMESTAMP(6)   NOT NULL
 ,Change_UserId  VARCHAR(32)    NOT NULL
 ,Delete_Ind     CHAR(1)        NOT NULL
 ,CONSTRAINT PK_Product_Audit PRIMARY KEY (Barcode, Change_Dtm)
)
;

In this case, whenever an update or delete is called for a Product, the following operations are followed:

Insert into the audit table the current row from Product
Update the Product table with the new values

Notes:

What's implicit in this discussion is that new data is written only if the data changes. You can enforce this either through transaction/ETL logic, or triggers to rollback attempts to insert data that is exactly the same as the prior values. This won't effect the data returned for a given query, but it goes a long way to making sure your table sizes don't explode unnecessarily.
If you have a lot of attributes, and some change frequently (such as Price), but others do not (Name, Description), you can always split things into more tables (Product_Price, Product_Name, etc.) and just create a view that incorporates all of those elements. This level of effort generally isn't necessary unless the entities have a lot of attributes or you will have a lot of ad-hoc queries that are asking time-specific questions that rely on knowing the prior value was actually different, such as "Which products increased price during this time frame?"
It's crucial you do not follow the pattern of just sticking an Id on every table and thinking that provides any sort of value. Time-variant data always requires composite keys and only returns consistent results if the data is properly normalized to at least 3NF. Do not use any sort of ORM that does not support composite keys.

نصائح أخرى

It should work. But looking on your diagram I think you could merge products with product_changeset as you could have for every product version info who and when changed that. You could also consider to avoid to have dedicated table current_product and instead use some kind of flag (active or timestamp on main products table). There is lot of possibilities, but best solution needs to think about:

how many products do you expect to have?
how often such changes can be made?
do you want to change any attribute or only to specific ones?
what is usage of this product history? Do you need to be able to revert to some point of history? Or do you want to show it somewhere to the user? Or is just it should be in db just for some audits, just as kind of logs?

added as a response to comment, it doesn't allow for such long ones :)

So I see two solutions here:

your original one with that change that current_products should provide unique product id which would be then referenced by products directly. I think you could change relation of product_changesets, and connect that to primary key of products. This seems like less used entity.
- + better queries for only current versions of product - which seems to be major of your needs
- - more complicated when new version is provided
- - more complicated when you need to revert to some previous version
- - risk of desynchronize tables - you can have different actual versions in both tables; which one should be a source of truth?
alternatively you could have one table with immutable values (like product id or similar) which would be main table in fact for products. Then there could be kind of product_history or product_versions which would have relation to main product table and would have all attributes plus some flag (like active) which would declaratively say that this only row should be considered as valid one.
- + easier than previously to add new version - you need only to have some trigger or similar to update flag of previous version, and some default on this flag, so modifying version would be only adding new rows in fact
- + no risk that you will have different values for current version in separate tables
- - slightly more complicated standard queries - you need to take product by, say, its id and this active column - you can have view for that
- - there is risk that after many years for many changes, this table will be quite big; so maybe you could store only previous version, and previous ones somehow remove?

In any case, you have just to connect this price_entries tables like you have already on diagram - using main unique id of product. Then, again, you have figure out how to say that specific entry is the current one - similarly like above you can try to reflect that by some current_price entity, but it seems like overcomplicating. There some active columns seems like doing a job. Do you need to connect specific price entry with specific version of product?

And please make sure that products attributes are kind of lookups (with dedicated tables), not free text fields, as it easily brings a mess to db.

In SQL Server my first thought would be to use temporal tables ("system versioned tables" from the SQL2011 standard) for everything, which we have used with some success in a couple of products. This way your main logic doesn't need to care about maintaining the history, just work with the current data as you otherwise would, only queries that specifically need to look back in time have to care at all.

While these are not supported by postgres that I know of (there are a couple of extensions listed in a quick search, but I don't know how complete/stable/supported they are) you could emulate the structure or something similar using triggers to populate the history tables as changes happen (something I did in our legacy products before temporal tables were available in SQL Server). You don't gain the syntactic sugar of statements like SELECT some_stuff FROM this_table AS OF SYSTEM_TIME a_specific_timestamp but you'll have the data there to query more manually.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى dba.stackexchange