What are best practices for storing many iterations of a product in an RDBMS?
-
09-03-2021 - |
سؤال
We are building a tool to track the prices of products over time, and using Postgres as our RDBMS. It is important that product attributes can be changed, and that the history of an product's attributes be preserved forever. Here is a schema we designed based on OpenStreetMap's internal schema:
We have a 'products' table on the left storing every version of every product, and a 'current_products' table on the right storing only the most recent version of each product. Every time we want to change a store, we:
- create an entry in changesets
- read the latest entry of the product in 'products', increment version by one, and create another entry with the changes
- delete the corresponding entry in 'current_products' and create a new one with the changes and the latest version number from 'products'
We want to enforce as many business rules in the database engine as possible rather than relying on our software to keep things consistent, and this schema feels pretty "off", so we welcome any suggestions. Thanks in advance!
Edit: Revised the schema based a response from @bbaird . Also decided to include versioning of stores and users. Tied products, stores, and users together with price table.
المحلول
You're right to feel the schema is off, because it is - the way it is designed now will not guarantee the minimum criteria required for consistency: As of a point in time, only one value can exist for a given attribute.
There are two ways to handle this, depending on the use case:
- Different versions of the attributes need to be accessed by the application
- Changes must be tracked for audit reasons only
Solution: Case 1
You would have a Product
table and a Product_Version
to store the necessary information. You will need a view/function to return the proper value.
Since you are dealing with food (and a standard source), I'm going to make certain assumptions about keys/datatypes. Feel free to comment to clarify.
CREATE TABLE Product
(
Barcode VARCHAR(13) NOT NULL
/* Store all invariant attributes in this table */
,CONSTRAINT PK_Product PRIMARY KEY (Barcode) /* This uniquely defines a product and is compact enough - no other key is necessary */
)
;
CREATE TABLE Product_Version
(
Barcode VARCHAR(13) NOT NULL
,Change_Dtm TIMESTAMP(6) NOT NULL
,Name VARCHAR(50) NOT NULL
,Price DECIMAL(8,2) NOT NULL /* Adjust as necessary */
,Currency_Cd CHAR(3) NOT NULL /* Should reference a Currency table with ISO codes (USD, EUR, GBP, etc) */
,Delete_Ind CHAR(1) NOT NULL
,Change_UserId VARCHAR(32) NOT NULL
,CONSTRAINT FK_Product_Version_Version_Of_Product FOREIGN KEY (Barcode) REFERENCES Product (Barcode)
,CONSTRAINT PK_Product_Version PRIMARY KEY (Barcode, Change_Dtm)
,CONSTRAINT CK_Product_Version_Price_GT_Zero CHECK (Price > 0)
,CONSTRAINT CK_Product_Version_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;
To get the values for a specific product as of a point in time, you would use the following query:
SELECT
PV.Barcode
,PV.Name
,PV.Price
,PV.Currency_Cd
FROM
Product_Version PV
WHERE
PV.Barcode = '8076809513388'
AND PV.Change_Dtm =
(
SELECT
MAX(Change_Dtm)
FROM
Product_Version
WHERE
Barcode = PV.Barcode
AND Change_Dtm <= '2020-10-29 12:30:00.000000'
)
You can also make a view to mimic the function of a table with static values:
CREATE VIEW v_Product AS
SELECT
PV.Barcode
,PV.Name
,PV.Price
,PV.Currency_Cd
FROM
Product_Version PV
WHERE
PV.Change_Dtm =
(
SELECT
MAX(Change_Dtm)
FROM
Product_Version
WHERE
Barcode = PV.Barcode
)
For one-to-many relationships (let's use Ingredient
for this example) you would follow a pattern like so:
CREATE TABLE Product_Ingredient
(
Barcode VARCHAR(13) NOT NULL
,Ingredient VARCHAR(50) NOT NULL /* Should reference an Ingredient table */
,Rank SMALLINT NOT NULL /* Uniqueness of this value needs to be handled through transaction logic */
,Change_Dtm TIMESTAMP(6) NOT NULL
,Delete_Ind CHAR(1) NOT NULL
,CONSTRAINT FK_Product_Ingredient_Used_In_Product FOREIGN KEY (Barcode) REFERENCES Product (Barcode)
,CONSTRAINT PK_Product_Ingredient PRIMARY KEY (Barcode, Change_Dtm)
,CONSTRAINT CK_Product_Ingredient_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;
Then to get a list of Ingredients
for a Product
at a point in time, you would use the following query:
SELECT
PI.Barcode
,PI.Ingredient
,PI.Rank
FROM
Product_Ingredient PI
WHERE
PI.Barcode = '8076809513388'
AND PI.Change_Dtm =
(
SELECT
MAX(Change_Dtm)
FROM
Product_Ingredient
WHERE
Barcode = PI.Barcode
AND Ingredient = PI.Ingredient
AND Change_Dtm <= '2020-10-29 12:30:00.000000' /* Or whatever */
)
AND PI.Delete_Ind = 'N'
Similar to the prior example, you can create a view to provide the current values for each of the one-to-many relations.
Solution: Case 2
If you merely need to store history, you simply make a small modification to the structure:
CREATE TABLE Product
(
Barcode VARCHAR(13) NOT NULL
,Name VARCHAR(50) NOT NULL
,Price DECIMAL(8,2) NOT NULL
,Currency_Cd CHAR(3) NOT NULL
,Change_UserId VARCHAR(32) NOT NULL
,Change_Dtm TIMESTAMP(6) NOT NULL
,Delete_Ind CHAR(1) NOT NULL
,CONSTRAINT PK_Product PRIMARY KEY (Barcode)
,CONSTRAINT CK_Product_Price_GT_Zero CHECK (Price > 0)
,CONSTRAINT CK_Product_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;
CREATE TABLE Product_Audit
(
Barcode VARCHAR(13) NOT NULL
,Name VARCHAR(50) NOT NULL
,Price DECIMAL(8,2) NOT NULL
,Currency_Cd CHAR(3) NOT NULL
,Change_Dtm TIMESTAMP(6) NOT NULL
,Change_UserId VARCHAR(32) NOT NULL
,Delete_Ind CHAR(1) NOT NULL
,CONSTRAINT PK_Product_Audit PRIMARY KEY (Barcode, Change_Dtm)
)
;
In this case, whenever an update or delete is called for a Product
, the following operations are followed:
- Insert into the audit table the current row from
Product
- Update the
Product
table with the new values
Notes:
- What's implicit in this discussion is that new data is written only if the data changes. You can enforce this either through transaction/ETL logic, or triggers to rollback attempts to insert data that is exactly the same as the prior values. This won't effect the data returned for a given query, but it goes a long way to making sure your table sizes don't explode unnecessarily.
- If you have a lot of attributes, and some change frequently (such as
Price
), but others do not (Name
,Description
), you can always split things into more tables (Product_Price
,Product_Name
, etc.) and just create a view that incorporates all of those elements. This level of effort generally isn't necessary unless the entities have a lot of attributes or you will have a lot of ad-hoc queries that are asking time-specific questions that rely on knowing the prior value was actually different, such as "Which products increased price during this time frame?" - It's crucial you do not follow the pattern of just sticking an
Id
on every table and thinking that provides any sort of value. Time-variant data always requires composite keys and only returns consistent results if the data is properly normalized to at least 3NF. Do not use any sort of ORM that does not support composite keys.
نصائح أخرى
It should work. But looking on your diagram I think you could merge products
with product_changeset
as you could have for every product version info who and when changed that. You could also consider to avoid to have dedicated table current_product
and instead use some kind of flag (active
or timestamp
on main products table). There is lot of possibilities, but best solution needs to think about:
- how many products do you expect to have?
- how often such changes can be made?
- do you want to change any attribute or only to specific ones?
- what is usage of this product history? Do you need to be able to revert to some point of history? Or do you want to show it somewhere to the user? Or is just it should be in db just for some audits, just as kind of logs?
added as a response to comment, it doesn't allow for such long ones :)
So I see two solutions here:
your original one with that change that
current_products
should provide unique productid
which would be then referenced byproducts
directly. I think you could change relation ofproduct_changesets
, and connect that to primary key ofproducts
. This seems like less used entity.- + better queries for only current versions of product - which seems to be major of your needs
- - more complicated when new version is provided
- - more complicated when you need to revert to some previous version
- - risk of desynchronize tables - you can have different actual versions in both tables; which one should be a source of truth?
alternatively you could have one table with immutable values (like product
id
or similar) which would be main table in fact for products. Then there could be kind ofproduct_history
orproduct_versions
which would have relation to main product table and would have all attributes plus some flag (likeactive
) which would declaratively say that this only row should be considered as valid one.- + easier than previously to add new version - you need only to have some trigger or similar to update flag of previous version, and some default on this flag, so modifying version would be only adding new rows in fact
- + no risk that you will have different values for current version in separate tables
- - slightly more complicated standard queries - you need to take product by, say, its id and this
active
column - you can have view for that - - there is risk that after many years for many changes, this table will be quite big; so maybe you could store only previous version, and previous ones somehow remove?
In any case, you have just to connect this price_entries
tables like you have already on diagram - using main unique id
of product. Then, again, you have figure out how to say that specific entry is the current one - similarly like above you can
try to reflect that by some current_price
entity, but it seems like overcomplicating. There some active
columns seems like doing a job. Do you need to connect specific price entry with specific version of product?
And please make sure that products attributes are kind of lookups (with dedicated tables), not free text fields, as it easily brings a mess to db.
In SQL Server my first thought would be to use temporal tables ("system versioned tables" from the SQL2011 standard) for everything, which we have used with some success in a couple of products. This way your main logic doesn't need to care about maintaining the history, just work with the current data as you otherwise would, only queries that specifically need to look back in time have to care at all.
While these are not supported by postgres that I know of (there are a couple of extensions listed in a quick search, but I don't know how complete/stable/supported they are) you could emulate the structure or something similar using triggers to populate the history tables as changes happen (something I did in our legacy products before temporal tables were available in SQL Server). You don't gain the syntactic sugar of statements like SELECT some_stuff FROM this_table AS OF SYSTEM_TIME a_specific_timestamp
but you'll have the data there to query more manually.