Associating on the ID of an association table
-
11-12-2020 - |
题
Sorry about the title, I couldn't find anything better. Suggestions are welcome.
Say we have two tables, Suppliers
and Products
.
The same product can come from many different suppliers, so we create a third table, let's call it SuppliersAndProducts
, with this schema
SuppliersAndProducts
- id (autoincrement)
- supplier_id
- product_id
- from_date
- to_date
from_
and to_date
are there to account for the fact that a given supplier might well stop selling a given product, but could also begin selling it again at some point in the future.
Now, we want to store the price we pay when we buy a Product
from a given Supplier
. Prices, of course, change over time, and we want to keep a track of that too.
So we introduce a table, let's call it SuppliersAndPrices
, structured like so:
SuppliersAndPrices
- id (autoincrement)
- supplier_id (FK to the supplier id)
- product_id (FK to the product id)
- price
- from_date
- to_date
My question about this last table is more conceptual than anything else: should this table be like I described (meaning that the association in based on both the supplier and product IDs) or instead just reference the id from SuppliersAndProducts
with a column that we could call suppliers_and_products_id
?
The latter design is more normalized than the former; after all, the association between a Supplier
and a Product
is already stated in SuppliersAndProducts
, so there would be little point in repeating that information. Still, at least to me, for some reason the former feels closer to the real world.
From the point of view of query complexity, to know the price I'm paying today for a given product, with the first design I would have to write (with some pseudo-SQL to remain db-agnostic) something like:
SELECT price
FROM SuppliersAndPrices
WHERE supplier_id = X
AND product_id = Y
AND today is between from_date and to_date
whereas the second design would entail a join, thus making the query look like:
SELECT price
FROM SuppliersAndPrices
INNER JOIN SuppliersAndProducts
ON SuppliersAndPrices.suppliers_and_products_id = SuppliersAndProducts.id
WHERE SuppliersAndProducts.supplier_id = X
AND SuppliersAndProducts.product_id = Y
AND today is between SuppliersAndPrices.from_date and SuppliersAndPrices.to_date
I'm writing this off the top of my head, sorry about my SQL.
Also, now that I think about it, with the second design I would have to add another condition to the WHERE
clause to check the dates in SuppliersAndProducts
too, to handle the case in which a Supplier
will stop carrying a product and, at some point, begin selling it again. In that case the join condition would return more than one row, and that would be Not A Good Thing™.
So, which one would you choose? Associating on the id of what already is an association in the name of greater normalization but at the price of increasing query complexity? Denormalizing a tiny bit to make querying easier and, arguably, the database structure clearer for future maintainers?
The query for the second design would probably be hidden behind a view, but that's still code that has to be maintained and understood. What I really want to know is which one of the two design is "better", for some definition of "better".
解决方案
Just to get this part out of the way, what is the intention of keeping the historical data? If it is for the app to allow users to interact with on some level, then ok, as that is still a transactional need. Else, if it is merely for reporting purposes, that could be tracked entirely in a separate server / database / schema as you are really talking about a Slowly Changing Dimension (SCD). If it only exists for reporting then no need to complicate your transactional model with anything but the current info.
That being said, I will assume the data is needed for transactional purposes, in which case there is no question (in my mind, at least :) that Option 2 (i.e. relating the Price info with the Supplier and Product relationship table) is the only way to go. If you relate the Price properties back to their respective sources then you allow for invalid combinations of Suppliers and Products that don't exist in the SuppliersAndProducts table. And while performance and maintainability are very important considerations when designing, they are secondary to data integrity as that is the primary responsibility of the database.
Some notes:
It is better to use the table name + "id" rather than the generic "id". This will make writing queries a lot easier as the field name will be the same between tables. Queries in general will be more readable.
While I also like naming tables as plural (it just sounds better), it does make things easier to use the singular. This way the "id" field of the table includes the table name without sounding odd, such as "SuppliersID" (as opposed to "SupplierID"). If you ever have to code automated processes against the tables, it is much easier to assume that the ID field is simply "{TableName}ID", else you might need to hit the database to do a look up. And for new people being trained it is just easier to know that any table will have a standard ID field name, which makes writing queries faster. This is why I no longer use plural table names :). So, for example,
SuppliersAndProduct
would beSupplierAndProduct
or maybe evenSupplierXProduct
.If it weren't for the historical values then
SupplierAndProduct
wouldn't need an auto-incrementing ID as the combination ofSupplierID
andProductID
would be the composite PK.But since we do have that history we can do two things with that relationship table:
- PK is the
SupplierAndProductID
field - Unique Index (representing an "Alternate Key") on the combination of the
SupplierAndProductID
,SupplierID
, andProductID
fields
- PK is the
For the
SupplierAndProductPrice
(orSupplierXProductPrice
) table we would have (just in terms of the Key fields here; no changes to the "price" and date, etc fields):- supplier_and_product_price_id (autoincrement) -- I will stop PascalCasing now ;-) - supplier_id (FK to supplier_and_product; could also FK to supplier) - product_id (FK to supplier_and_product; could also FK to product) - supplier_and_product_id (FK to supplier_and_product)
In this model, since the supplier_and_product_id field lacks any real meaning, we bring along the other two fields and the combination of all 3 will relate back to the Unique Index on supplier_and_product. I generally try to avoid FKs to Unique Indexes/Constraints but this is one situation where it makes sense to do it.
Putting all 3 fields together in a Unique Index in order to reference the FK to it allows for bringing over the two needed fields (
supplier_id
andproduct_id
) while guaranteeing that the combination of those two is always a valid combination (as it is enforced by the FK).I would recommend against combining these into a single table as it duplicates relationship info of Product and Supplier when the Price changes during the
from
andto
dates of the Product-to-Supplier relationship. But, that is just for the transactional side. On the reporting side it is definitely a good idea to combine them as suggested by @JonofAllTrades.Regarding the need for the extra WHERE condition to check the date against the
from
andto
dates insupplier_and_product
: this is not a bad idea, but is technically just a safe guard, assuming that the app doesn't allow a given Price entry to have dates that fall outside of the time-frame that the Product is being supplied by that Supplier. But if the dates are somewhat validate against such oddities, then you can now get away with the simplified query of Option 1 (since theproduct_id
andsupplier_id
fields are there) while having the integrity of Option 2 (since those fields also FK back to thesupplier_and_product
relationship table). Which means you really should FK theproduct_id
andsupplier_id
fields in thePrice
table back to both the relationship table and their respective parent tables since it will help with JOINs betweenPrice
and the two parent tables.
其他提示
I'd combine the two. Add Price
to your SuppliersAndProducts
table (and maybe call it something else, like Catalog
or Offerings
), and when the price changes end the first record and start another.
CREATE TABLE Catalog
(
CatalogID INT NOT NULL IDENTITY,
SupplierID INT NOT NULL REFERENCES Suppliers (SupplierID),
ProductID INT NOT NULL REFERENCES Products (ProductID),
Price DECIMAL(19, 2), -- Nullable
EffectiveDate DATE WITH OFFSET NOT NULL DEFAULT NOW(),
EndDate DATE WITH OFFSET -- Nullable
)
PK on CatalogID
, biz key on { SupplierID
, ProductID
, EffectiveDate
}.
For comparison. In most of the sales-like databases I've seen, the price is stored only in the OrderDetails
table, which is simpler still but obviously does not let you get the historical price of a product if you did not order it from a particular supplier in the date range in question.
To illustrate srutsky's point about granularity: this model will create more supplier/product records than the OP's second model, where prices are tracked in a separate table from when the products are sold. If we start with a simple example, my model is simpler; if there are two suppliers for a single product, that's two records:
Catalog
Supplier #1, Product A, 2014/07 to NULL, $100
Supplier #2, Product A, 2014/11 to NULL, $200
...whereas the other model has four records:
Offerings
Supplier #1, Product A, 2014/07 to NULL
Supplier #2, Product A, 2014/11 to NULL
Pricing
Reference { Supplier #1, Product A }, 2014/07 to NULL, $100
Reference { Supplier #2, Product A }, 2014/11 to NULL, $200
However, my model leads to more records in its (only) table once prices start to change. Let's say both suppliers give a 25% discount for the holidays and then revert it:
Catalog
Supplier #1, Product A, 2014/07 to 2014/12, $100
Supplier #1, Product A, 2014/12 to 2015/01, $75
Supplier #1, Product A, 2015/01 to NULL, $100
Supplier #2, Product A, 2014/11 to 2014/12, $200
Supplier #2, Product A, 2014/12 to 2015/01, $150
Supplier #2, Product A, 2015/01 to NULL, $200
...vs:
Offerings
Supplier #1, Product A, 2014/07 to NULL
Supplier #2, Product A, 2014/11 to NULL
Pricing
Reference { Supplier #1, Product A }, 2014/07 to 2014/12, $100
Reference { Supplier #2, Product A }, 2014/11 to 2015/01, $200
Reference { Supplier #1, Product A }, 2015/01 to NULL, $100
Reference { Supplier #1, Product A }, 2014/11 to 2014/12, $75
Reference { Supplier #2, Product A }, 2014/12 to 2015/01, $150
Reference { Supplier #1, Product A }, 2015/01 to NULL, $75
In the Offerings
+ Pricing
model, only the Pricing
table grows when prices change. The Offerings
table does not, so queries which don't need pricing are not burdened with extra rows. Querying Offerings
is very clean, at the cost that querying Pricing
is more complicated.
For this example, the difference is small. However, if there are many more attributes of an offering, like manufacturer's SKU or credit terms, the single Catalog
table will get more rows as well as more columns. If the supplier changes their SKU, that's a new record; if they change their terms a month later, that's another new record. In extremis, it might get to the point that some attribute changes every day, and your Catalog
table would degenerate into "what was true for this supplier, for this product, on this specific day."
Now, this is only a problem to the extent that the attributes in Catalog
are volatile. I suspect that, in practice, a supplier's price, SKU, and terms will rarely change more than once or twice a year, so we're talking about tens of thousands of rows, not millions. One would need to know more about your industry, the size of your product and supplier bases, and the kind of additional attributes you might want, to judge for sure. Srutsky's model will scale better, so if you're thinking big that's the right path. If your need is small- or even medium-scale, I suspect you're better off keeping it simple.