Associating on the ID of an association table

https://dba.stackexchange.com/questions/83293

11-12-2020
|

题

Sorry about the title, I couldn't find anything better. Suggestions are welcome.

Say we have two tables, Suppliers and Products. The same product can come from many different suppliers, so we create a third table, let's call it SuppliersAndProducts, with this schema

SuppliersAndProducts

  - id (autoincrement)
  - supplier_id
  - product_id
  - from_date
  - to_date

from_ and to_date are there to account for the fact that a given supplier might well stop selling a given product, but could also begin selling it again at some point in the future.

Now, we want to store the price we pay when we buy a Product from a given Supplier. Prices, of course, change over time, and we want to keep a track of that too.

So we introduce a table, let's call it SuppliersAndPrices, structured like so:

SuppliersAndPrices

  - id (autoincrement)
  - supplier_id (FK to the supplier id)
  - product_id (FK to the product id)
  - price
  - from_date
  - to_date

My question about this last table is more conceptual than anything else: should this table be like I described (meaning that the association in based on both the supplier and product IDs) or instead just reference the id from SuppliersAndProducts with a column that we could call suppliers_and_products_id?

The latter design is more normalized than the former; after all, the association between a Supplier and a Product is already stated in SuppliersAndProducts, so there would be little point in repeating that information. Still, at least to me, for some reason the former feels closer to the real world.

From the point of view of query complexity, to know the price I'm paying today for a given product, with the first design I would have to write (with some pseudo-SQL to remain db-agnostic) something like:

SELECT price 
FROM   SuppliersAndPrices 
WHERE  supplier_id = X 
       AND product_id = Y 
       AND today is between from_date and to_date

whereas the second design would entail a join, thus making the query look like:

SELECT price 
FROM   SuppliersAndPrices 
       INNER JOIN SuppliersAndProducts 
               ON SuppliersAndPrices.suppliers_and_products_id = SuppliersAndProducts.id 
WHERE  SuppliersAndProducts.supplier_id = X 
       AND SuppliersAndProducts.product_id = Y 
       AND today is between SuppliersAndPrices.from_date and SuppliersAndPrices.to_date

I'm writing this off the top of my head, sorry about my SQL.

Also, now that I think about it, with the second design I would have to add another condition to the WHERE clause to check the dates in SuppliersAndProducts too, to handle the case in which a Supplier will stop carrying a product and, at some point, begin selling it again. In that case the join condition would return more than one row, and that would be Not A Good Thing™.

So, which one would you choose? Associating on the id of what already is an association in the name of greater normalization but at the price of increasing query complexity? Denormalizing a tiny bit to make querying easier and, arguably, the database structure clearer for future maintainers?

The query for the second design would probably be hidden behind a view, but that's still code that has to be maintained and understood. What I really want to know is which one of the two design is "better", for some definition of "better".

解决方案

Just to get this part out of the way, what is the intention of keeping the historical data? If it is for the app to allow users to interact with on some level, then ok, as that is still a transactional need. Else, if it is merely for reporting purposes, that could be tracked entirely in a separate server / database / schema as you are really talking about a Slowly Changing Dimension (SCD). If it only exists for reporting then no need to complicate your transactional model with anything but the current info.

That being said, I will assume the data is needed for transactional purposes, in which case there is no question (in my mind, at least :) that Option 2 (i.e. relating the Price info with the Supplier and Product relationship table) is the only way to go. If you relate the Price properties back to their respective sources then you allow for invalid combinations of Suppliers and Products that don't exist in the SuppliersAndProducts table. And while performance and maintainability are very important considerations when designing, they are secondary to data integrity as that is the primary responsibility of the database.

Some notes:

It is better to use the table name + "id" rather than the generic "id". This will make writing queries a lot easier as the field name will be the same between tables. Queries in general will be more readable.
While I also like naming tables as plural (it just sounds better), it does make things easier to use the singular. This way the "id" field of the table includes the table name without sounding odd, such as "SuppliersID" (as opposed to "SupplierID"). If you ever have to code automated processes against the tables, it is much easier to assume that the ID field is simply "{TableName}ID", else you might need to hit the database to do a look up. And for new people being trained it is just easier to know that any table will have a standard ID field name, which makes writing queries faster. This is why I no longer use plural table names :). So, for example, SuppliersAndProduct would be SupplierAndProduct or maybe even SupplierXProduct.
If it weren't for the historical values then SupplierAndProduct wouldn't need an auto-incrementing ID as the combination of SupplierID and ProductID would be the composite PK.
But since we do have that history we can do two things with that relationship table:
1. PK is the SupplierAndProductID field
2. Unique Index (representing an "Alternate Key") on the combination of the SupplierAndProductID, SupplierID, and ProductID fields
For the SupplierAndProductPrice (or SupplierXProductPrice) table we would have (just in terms of the Key fields here; no changes to the "price" and date, etc fields):
```
- supplier_and_product_price_id (autoincrement) -- I will stop PascalCasing now ;-)
- supplier_id (FK to supplier_and_product; could also FK to supplier)
- product_id (FK to supplier_and_product; could also FK to product)
- supplier_and_product_id (FK to supplier_and_product)
```
In this model, since the supplier_and_product_id field lacks any real meaning, we bring along the other two fields and the combination of all 3 will relate back to the Unique Index on supplier_and_product. I generally try to avoid FKs to Unique Indexes/Constraints but this is one situation where it makes sense to do it.

Putting all 3 fields together in a Unique Index in order to reference the FK to it allows for bringing over the two needed fields (supplier_id and product_id) while guaranteeing that the combination of those two is always a valid combination (as it is enforced by the FK).
I would recommend against combining these into a single table as it duplicates relationship info of Product and Supplier when the Price changes during the from and to dates of the Product-to-Supplier relationship. But, that is just for the transactional side. On the reporting side it is definitely a good idea to combine them as suggested by @JonofAllTrades.
Regarding the need for the extra WHERE condition to check the date against the from and to dates in supplier_and_product: this is not a bad idea, but is technically just a safe guard, assuming that the app doesn't allow a given Price entry to have dates that fall outside of the time-frame that the Product is being supplied by that Supplier. But if the dates are somewhat validate against such oddities, then you can now get away with the simplified query of Option 1 (since the product_id and supplier_id fields are there) while having the integrity of Option 2 (since those fields also FK back to the supplier_and_product relationship table). Which means you really should FK the product_id and supplier_id fields in the Price table back to both the relationship table and their respective parent tables since it will help with JOINs between Price and the two parent tables.

其他提示

I'd combine the two. Add Price to your SuppliersAndProducts table (and maybe call it something else, like Catalog or Offerings), and when the price changes end the first record and start another.

CREATE TABLE Catalog
    (
    CatalogID     INT NOT NULL IDENTITY,
    SupplierID    INT NOT NULL REFERENCES Suppliers (SupplierID),
    ProductID     INT NOT NULL REFERENCES Products (ProductID),
    Price         DECIMAL(19, 2),  -- Nullable
    EffectiveDate DATE WITH OFFSET NOT NULL DEFAULT NOW(),
    EndDate       DATE WITH OFFSET  -- Nullable 
    )

PK on CatalogID, biz key on { SupplierID, ProductID, EffectiveDate }.

For comparison. In most of the sales-like databases I've seen, the price is stored only in the OrderDetails table, which is simpler still but obviously does not let you get the historical price of a product if you did not order it from a particular supplier in the date range in question.

To illustrate srutsky's point about granularity: this model will create more supplier/product records than the OP's second model, where prices are tracked in a separate table from when the products are sold. If we start with a simple example, my model is simpler; if there are two suppliers for a single product, that's two records:

Catalog
Supplier #1, Product A, 2014/07 to NULL, $100
Supplier #2, Product A, 2014/11 to NULL, $200

...whereas the other model has four records:

Offerings
Supplier #1, Product A, 2014/07 to NULL
Supplier #2, Product A, 2014/11 to NULL

Pricing
Reference { Supplier #1, Product A }, 2014/07 to NULL, $100
Reference { Supplier #2, Product A }, 2014/11 to NULL, $200

However, my model leads to more records in its (only) table once prices start to change. Let's say both suppliers give a 25% discount for the holidays and then revert it:

Catalog
Supplier #1, Product A, 2014/07 to 2014/12, $100
Supplier #1, Product A, 2014/12 to 2015/01, $75
Supplier #1, Product A, 2015/01 to NULL, $100
Supplier #2, Product A, 2014/11 to 2014/12, $200
Supplier #2, Product A, 2014/12 to 2015/01, $150
Supplier #2, Product A, 2015/01 to NULL, $200

...vs:

Offerings
Supplier #1, Product A, 2014/07 to NULL
Supplier #2, Product A, 2014/11 to NULL

Pricing
Reference { Supplier #1, Product A }, 2014/07 to 2014/12, $100
Reference { Supplier #2, Product A }, 2014/11 to 2015/01, $200
Reference { Supplier #1, Product A }, 2015/01 to NULL, $100
Reference { Supplier #1, Product A }, 2014/11 to 2014/12, $75
Reference { Supplier #2, Product A }, 2014/12 to 2015/01, $150
Reference { Supplier #1, Product A }, 2015/01 to NULL, $75

In the Offerings + Pricing model, only the Pricing table grows when prices change. The Offerings table does not, so queries which don't need pricing are not burdened with extra rows. Querying Offerings is very clean, at the cost that querying Pricing is more complicated.

For this example, the difference is small. However, if there are many more attributes of an offering, like manufacturer's SKU or credit terms, the single Catalog table will get more rows as well as more columns. If the supplier changes their SKU, that's a new record; if they change their terms a month later, that's another new record. In extremis, it might get to the point that some attribute changes every day, and your Catalog table would degenerate into "what was true for this supplier, for this product, on this specific day."

Now, this is only a problem to the extent that the attributes in Catalog are volatile. I suspect that, in practice, a supplier's price, SKU, and terms will rarely change more than once or twice a year, so we're talking about tens of thousands of rows, not millions. One would need to know more about your industry, the size of your product and supplier bases, and the kind of additional attributes you might want, to judge for sure. Srutsky's model will scale better, so if you're thinking big that's the right path. If your need is small- or even medium-scale, I suspect you're better off keeping it simple.

许可以下： CC-BY-SA 和归因

不隶属于 dba.stackexchange