How would you model data variables variance on common scheme? SQL

https://stackoverflow.com/questions/662514

20-08-2019
|

Question

I was thinking about some stuff lately and I was wondering what would be the RIGHT way to do something like the following scenario (I'm sure it is a quite common thing for DB guys to do something like it).

Let's say you have a products table, something like this (MySQL):

CREATE TABLE `products` (
  `id` int(11) NOT NULL auto_increment,
  `product_name` varchar(255) default NULL,
  `product_description` text,
  KEY `id` (`id`),
  KEY `product_name` (`product_name`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Nothing out of the ordinary here. Now lets say that there are a hierarchy of categories in a different table, and there is a separate table which binds many-to-many relationships with products table - so that each product belongs to some kind of a category (I'll omit those, because thats not the issue here).

Now comes the interesting part - what IF each of the categories mandates additional set of variables to the product items. For example products in the computer monitors category must have LCD/CRT enum field, screen size enum etc. - and some other category, lets say ice creams have some other variables like flavor varchar, shelf storage time int etc.

The problem herein lies in that all products have a common set of variables (id, name, description and sort of like that), but there are additional variables which are not consistent from category to category - but all products should share common set, because in the end they all belong to the products group, so one can query for example SELECT * FROM products ORDER BY company_id (trivial example, maybe not representative, but you get the picture).

Now, I see severa potential resolutions:
- generate separate table for each product category and store products there with appropriate additional variables - stupid and not query friendly

- product table stays the same with common variables, and for each category create a separate table with additional variables binding two tables with a JOIN - normalized, but query performance and clarity issues - how would one filter down products from category (1st table - products) and additional filter for extra variable (17" LCD monitors ie.) - it would require SQL JOIN trickery

- products table stays the same and add another variable type text that holds for example JSON data that hold additional variables - compact and neat, but can't filter through variables with SQL

I know I'm missing something quite obvious and simple here - I'm a bit rusty on the normalization techniques :)

edit: I've been searching around stackoverflow before asking this question without success. However, after I've posted the question I have clicked on one of my tags 'normalization' and found several similar questions which resulted in to look up 'generalization specialization relational design'. Point of the story is that this must be the first occurrence in my internet life that tags are actually useful in search. However, I would still like to hear from you guys and your opinions.

edit2: The problem with approach no.2 is that I expect somewhere around ~1000 specializations. There is a hierarchy (1-4 level deep) of categories and end nodes add specialized variables - they accumulate in the order of ~1000, so it would be a bit unpractical to add specialized tables to join with.

edit3: Due to the vast number of attribute volatility in my case "entity attribute value" that was suggested looks like the way to go. Here comes query nightmares! Thanks guys.

Solution

How many product types do you expect? Do they each have their own application logic?

You can do a generalized model called the "entity attribute value" model, but it has a LOT of pitfalls when you're trying to deal with specific properties of a product. Simple search queries turn into real nightmares at times. The basic idea is that you have a table that holds the product ID, property name (or ID into a properties table), and the value. You can also add in tables to hold templates for each product type. So one set of tables would tell you for any given product what properties it can have (possibly along with valid value ranges) and another set of tables would tell you for any individual product what the values are.

I would caution strongly against using this model though, since it seems like a really slick idea until you have to actually implement it.

If you number of product types is reasonably limited, I'd go with your second solution - one main product table with base attributes and then additional tables for each specific type of product.

OTHER TIPS

I've been doing this in Oracle.

I had the following tables:

t_class (id RAW(16), parent RAW(16)) -- holds class hierachy.
t_property (class RAW(16), property VARCHAR) -- holds class members.
t_declaration (id RAW(16), class RAW(16)) -- hold GUIDs and types of all class instances
t_instance (id RAW(16), class RAW(16), property VARCHAR2(100), textvalue VARCHAR2(200), intvalue INT, doublevalue DOUBLE, datevalue DATE) -- holds 'common' properties

t_class1 (id RAW(16), amount DOUBLE, source RAW(16), destination RAW(16)) -- holds 'fast' properties for class1.
t_class2 (id RAW(16), comment VARCHAR2(200)) -- holds 'fast' properties for class2
--- etc.

RAW(16) is where Oracle holds GUIDs

If you want to select all properties for an object, you issue:

SELECT  i.*
FROM    (
        SELECT  id 
        FROM    t_class
        START WITH
                id = (SELECT class FROM t_declaration WHERE id = :object_id)
        CONNECT BY
                parent = PRIOR id
        ) c
JOIN    property p
ON      p.class = c.id
LEFT JOIN
        t_instance i
ON      i.id = :object_id
        AND i.class = p.class
        AND i.property = p.property

t_property hold stuff you normally don't search on (like, text descriptions etc.)

Fast properties are in fact normal tables you have in the database, to make the queries efficient. They hold values only for the instances of a certain class or its descendants. This is to avoid extra joins.

You don't have to use fast tables and limit all your data to these four tables.

For you task it will look like this (I'll use strings in square brackets instead of GUID's for the sake of brevity):

t_class

id             parent

[ClassItem]    [ClassUnknown]
[ClassMonitor] [ClassItem]
[ClassLCD]     [ClassMonitor]

t_property

class          property

[ClassItem]    price
[ClassItem]    vendor
[ClassItem]    model
[ClassMonitor] size
[ClassLCD]     matrixType

t_declaration

id             class
[1]            [ClassLCD] -- Iiyama ProLite E1700

t_instance  -- I'll put all values into one column, disregarding type (INT, VARCHAR etc)

id             class           property         value

[1]            [ClassItem]     price            $300
[1]            [ClassItem]     vendor           Iiyama
[1]            [ClassItem]     model            ProLite E1700s
[1]            [ClassMonitor]  size             17
[1]            [ClassLCD]      matrixType       TFT

If you need some complex query that searches on, say, size AND matrixType, you may remove them from property and instance and create another table:

t_lcd (id RAW(16), size INT, matrixType VARCHAR2(200))

id             size            matrixType

[1]            17              TFT

and use it to join with other properties instead of t_declaration in the query above.

But this model is viable even without the fast tables.

There is a name for this pattern. It's called "generalization specialization".

If you search on "generalization specialization modeling" you'll get some articles on how to do this. Some of these articles lean towards relational modeling and SQL, while others lean towards object modeling.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow