RDBMS: Proper way to store data - comma separated variables or in different fields or tables?

https://dba.stackexchange.com/questions/268830

03-03-2021
|

Question

Fundamental question that I cannot seem to find an answer on. I have a database that stores line items from receipts along with the username and receipt number.

Right now the lineItem column is just a long string of data separated by commas (original file appears to have been just an Excel file). This information is parsed in a PHP script for viewing on the front end.

The table looks like this:

|----------|----------|----------|
|lineItem  |receiptID |customerID|
|----------|----------|----------|
|CD, DVD,  |001       |User01    |
|----------|----------|----------|
|CD, CD,   |002       |User02    |
|DVD, usb, |          |          |
|----------|----------|----------|

Ultimately, is this bad practice? Should the lineItem values be linked to related values in another table instead maybe?

Solution

Why storing data as a string a problem :

Storing multiple datums as (in this case comma separated) strings is bad practice because:

- First reason:

It breaches Codd's second rule (called the "Guaranteed Access Rule") which states that Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

So, if you wish to refer to user02's USB lineItem, you have to do further processing beyond simply knowing the table name, the PRIMARY KEY and the column name.

From here: Among the conventional database models, the Relational Model of data has a simple, sound mathematical foundation based on the notions of set theory. And from the wiki on the Relational Model (RM) we have:

The relational model was the first database model to be described in formal mathematical terms. Hierarchical and network databases existed before relational databases, but their specifications were relatively informal. After the relational model was defined, there were many attempts to compare and contrast the different models, and this led to the emergence of more rigorous descriptions of the earlier models; though the procedural nature of the data manipulation interfaces for hierarchical and network databases limited the scope for formalization.

So, basically, the only data model with a sound mathematical foundation is the relational one. Most relational databases use the SQL data definition and query language; these systems implement what can be regarded as an engineering approximation to the relational model. [ibid].

Codd derived his rules as a guide to a practical implementation of his relational calculus - given that it is the only model with a sound mathematical foundation, it seems that it would be a bad idea to breach any of them.

Caveat: now, if for example, you will NEVER EVER want to break out the line items into their individual components, then storing it as one "unit" would be acceptable, but I can see many instances of where you would want to split it up into its component parts (see the fifth reason below).

An example of where you might want to store data in .csv form might be storing somebody's name and title for an academic journal - it might be stored thus:

Citizen, Seán B., Prof.

and that's the only way you're ever going to print/process/transmit/store this information, then it is a datum, not comma separated variables - datum or data is very much a contextual concept.

- Second reason:

As mentioned in comments, your lineItem table isn't even in first normal form (see the chart here - Atomic columns (cells have single value). This is obviously related to the point above. Database normalisation is

the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity.

These "forms" flowed from the RM/Relational Calculus and Codd's rules as a way of ensuring that data remains consistent, which is obviously of paramount importance in any database system - in simple terms, it's how we ensure that the definitive archetype of a given datum is stored in one place and one place only.

- Third reason:

You have no way of controlling what data is entered into that field - i.e. you have no way of controlling Declarative Referential Integrity (DRI). This means, for example, that there's nothing to stop you referring to products which don't exist (say, DVDx).

DRI is one of the most important benefits of using the RM - it means that internal data consistency can be maintained, the benefits of which you will greatly appreciate if you've ever you've had the misfortune to have worked with a system where this has broken down.

In point two, we said that Normal Forms were to ensure that the definitive archetype of a given datum is stored in one place and one place only - DRI ensures that all other references to that datum point to that one place and nowhere else.

- Fourth reason:

SQL is not designed for parsing strings - it can be done, it's just messy, time-consuming and error-prone. Many proprietary extensions have been developed by the various RDBMS providers to try and overcome this lacuna, but it's still far easier to deal with properly normalised tables (see the SQL below).

- Fifth reason:

Apart from the "theoretical" (more or less) reasons for not doing this, is the massive PRACTICAL problem of not being able to assign individual quantities and prices to items under your schema - suppose I'm doing my Christmas shopping and I wanted the new "U2 CD" for 3 of my friends who are U2 fanatics? No way of telling the system that there are 3 U2 CD's other than having a field value like this:

'"U2 CD", "U2 CD", "U2 CD" "UB40 CD", "U2 DVD", "Kingston USB 32GB"' -- note repetition of "U2 CD".

Suppose you want to know the number of USB's sold? The number of USB's per client? The number per client district/area/country - depending on the scale of your operation (see SQL below)? Suppose I want to know how much was spent on USB drives last week - absolutely no way of getting any of that information! The list goes on...

How to deal with the problem:

So, having dealt with the first part of your question, we can now progress to the second part - Should the lineItem values be linked to relational values in another table instead maybe?.

- First solution (extra fields):

Here is a further example of the problems associated with storing strings. In this case, adding fields to a given record is the solution - i.e. splitting the string into its component parts and making each one a field! Very helpful for enforcing DRI and controlling data correctness if one has reference tables for (in this case) ZIP codes, street names &c...

- Second solution (extra records - 1-to-many relationships):

In this particular case of your question, what we have here is a classic 1-many relationship - also known as parent-child where receipt is the parent and line_item is the child.

Your table structure is this:

CREATE TABLE line_item
(
  lineItem VARCHAR(2000),  -- could have a many items - need a very long string - parsing a nightmare! 
  receiptID INTEGER,       -- "001" could be a string - MySQL has a zero-fill function
  customeID VARCHAR(20)     -- redundant - don't need to store it for every line_item - it corresponds to a receipt (1 customer/receipt), not a line_item!
);

What you should have is something like this (see the fiddle here - all data and tables are also given at the bottom of this answer):

CREATE TABLE line_item
(
  receipt_id INTEGER NOT NULL,
  item_id INTEGER NOT NULL,
  item_qty INTEGER NOT NULL,
  CONSTRAINT line_item_pk PRIMARY KEY (receipt_id, item_id),
  CONSTRAINT li_item_fk FOREIGN KEY (item_id) REFERENCES item (item_id),
  CONSTRAINT li_receipt_fk FOREIGN KEY (receipt_id) REFERENCES receipt (receipt_id)
);

and your data will (rather cryptically) look like this:

INSERT INTO line_item VALUES
(1, 1, 1), (1, 4, 1), (2, 2, 1), (2, 3, 1), (2, 5, 1);

The receipt_id fields and the item_id fields point to the PRIMARY KEYs of their respective tables - and there is no redundant, extraneous information in the table - no customer_id stored multiple times for example! This way of modelling allows one to write queries of the form:

SELECT 
  c.customer_id, c.customer_name, c.customer_address_1,
  i.item_desc, i.item_price, 
  r.receipt_id, 
  li.item_id, li.item_qty
FROM 
  customer c
JOIN receipt r 
  ON c.customer_id = r.customer_id
JOIN line_item li 
  ON r.receipt_id = li.receipt_id
JOIN item i 
  ON li.item_id = i.item_id;

Result:

customer_id  customer_name  customer_address_1  item_desc          item_price   receipt_id  item_id     item_qty
1            Bill Gates     Redmond             Michael Jackson CD      1.50              1     1          1
1            Bill Gates     Redmond             U2 DVD                   5.00             1     4          1
2            Larry Ellison  Redwood Shores      U2 CD                    2.00             2     2          1
2            Larry Ellison  Redwood Shores      UB40 CD                 4.00              2     3          1
2            Larry Ellison  Redwood Shores      Kingston USB 32GB       25.00             2     5          1

See the fiddle (or below) for the all the DDL and DML! I challenge you to do this trivially with a .csv string containing your line_item products - especially in MySQL! It probably would be feasible in PostgreSQL using something like array_to_table after feeding the string into an array but I leave that as an exercise for you!

So, for a 1-many relationship, you add items to your line_item table - one item for each element in your .csv string - 1 receipt parent record can have 1 to many (possibly a very large number) of line_item children.

Now, the item table is also a parent of line_item and in its case, there can be 0 to many children if, for example, if an item hasn't sold at all there will be no references to it in the line_item table.

- Third solution (extra tables - many-to-many relationships):

There is an appropriate circumstance when "values SHOULD be linked to relational values in another table" (as you hint at in your question) and this is when there is an m-to-n relationship - otherwise know as a many-to-many relationship.

Consider the old favourite Databases-101 example of students and courses and the many courses taken by many students! See the fiddle here - I haven't populated the tables this time. I've used PostgreSQL for the fiddle (my favourite server) but a bit of tweaking will get it to work on any reasonable RDBMS.

Create tables course and student:

CREATE TABLE course
(
  course_id SERIAL,  -- INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY in MySQL dialect
  course_year SMALLINT NOT NULL,
  course_name VARCHAR (100) NOT NULL,
  CONSTRAINT course_pk PRIMARY KEY (course_id)
);


CREATE TABLE student
(
  student_id SERIAL,
  student_name VARCHAR (50),
  CONSTRAINT student_pk PRIMARY KEY (student_id)
);

This is where a JOINing table (aka linking table(more formally known as an [Associative Entity`]13 - as an aside, there are 17 different names for this type of table on that page) comes in.

a given student can take many courses
a given course can have many students

So, you deal with this by creating the Associative Entity - your JOINing table:

CREATE TABLE registration 
(
  reg_course_id INTEGER NOT NULL,
  reg_student_id INTEGER NOT NULL,

  CONSTRAINT reg_course_fk FOREIGN KEY (reg_course_id) REFERENCES course (course_id),
  CONSTRAINT reg_student_fk FOREIGN KEY (reg_student_id) REFERENCES student (student_id)
);

I then add a PRIMARY KEY - I kept it out of the table definition to illustrate the point, but it could (and normally would) be part of the table creation DDL.

ALTER TABLE registration
ADD CONSTRAINT registration_pk 
PRIMARY KEY (reg_course_id, reg_student_id);

So now,

a given student can only be enroll in a given course once and
a given course can only have the same student enrolled once

There are many other situations where this construct is useful - basically, it's the only way to meaningfully model many real-life situations.

An example from my own career:

Think of a flight table containing a flight_id field, a list of departure and arrival airports and the relevant times and then also a crew table with crew members and a crew_id field (and other details obviously).

Having the flight_id and crew_id fields in a JOINing table proved to be very useful for the system - it really helped with scheduling and rostering which was a mess with the other system - constant conflicts in both. It takes time and experience to recognise when which schema design is suitable for which scenario, but 1-many (extra records in existing table) and many-many (extra JOINing table) is a good rule of thumb!

p.s. welcome to the forum!

_ Full DDL and DML ___

`Customer table:`

CREATE TABLE customer  -- storing the customer_id on every line item is redundant - check out 3rd normal form
(
  customer_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
  customer_name VARCHAR (100) NOT NULL,
  customer_address_1 VARCHAR (100) NOT NULL -- can have address_1..n
  --
  -- other fields of particular interest to you
  --
);

`Customer data:`

INSERT INTO customer (customer_name, customer_address_1) VALUES 
('Bill Gates', 'Redmond'), ('Larry Ellison', 'Redwood Shores');

`item table:`

CREATE TABLE item ( item_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, item_code VARCHAR (25) NOT NULL UNIQUE, item_desc VARCHAR (200) NOT NULL, item_price DECIMAL(10, 2), item_supplier INTEGER NOT NULL -- refers to supplier table - not shown! -- -- other fields of interest to you -- );

`item data:`

INSERT INTO item (item_code, item_desc, item_price, item_supplier) VALUES
('code_1', 'Michael Jackson CD', 1.5, 56), ('code_2', 'U2 CD', 2, 78), ('code_3', 'UB40 CD', 4, 67),
('code_4', 'U2 DVD', 5, 78), ('code_5', 'Kingston USB 32GB', 25, 23);

`receipt table:`

CREATE TABLE receipt -- often called "orders" but receipt is OK ( receipt_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, customer_id INTEGER NOT NULL, -- refer to customer table - see below

receipt_total DECIMAL(10, 2), -- kept updated by a trigger (not shown) -- can be calculated on the fly or -- possibly a generated field receipt_dt TIMESTAMP NOT NULL, -- date and time of sale receipt_asst INTEGER, -- refers to the sales assistant table - not shown CONSTRAINT rec_cust_fk FOREIGN KEY (customer_id) REFERENCES customer (customer_id) );

`receipt data:`

INSERT INTO receipt (customer_id, receipt_total, receipt_dt, receipt_asst)
VALUES
(1, 6.5, '2020-06-03 15:23:45.123', 34),
(2, 31 , '2020-06-05 10:54:23.123', 17);

`line_item table:`

CREATE TABLE line_item
(
  receipt_id INTEGER NOT NULL,
  item_id INTEGER NOT NULL,
  item_qty INTEGER NOT NULL,
  CONSTRAINT line_item_pk PRIMARY KEY (receipt_id, item_id),
  CONSTRAINT li_item_fk FOREIGN KEY (item_id) REFERENCES item (item_id),
  CONSTRAINT li_receipt_fk FOREIGN KEY (receipt_id) REFERENCES receipt (receipt_id)
);

`line_item data:`

INSERT INTO line_item VALUES
(1, 1, 1), (1, 4, 1), (2, 2, 1), (2, 3, 1), (2, 5, 1);

Query:

SELECT 
  c.customer_id, c.customer_name, c.customer_address_1,
  i.item_desc, i.item_price, 
  r.receipt_id, 
  li.item_id, li.item_qty
FROM 
  customer c
JOIN receipt r 
  ON c.customer_id = r.customer_id
JOIN line_item li 
  ON r.receipt_id = li.receipt_id
JOIN item i 
  ON li.item_id = i.item_id;

Result:

customer_id  customer_name  customer_address_1  item_desc          item_price   receipt_id  item_id     item_qty
1            Bill Gates     Redmond             Michael Jackson CD      1.50              1     1          1
1            Bill Gates     Redmond             U2 DVD                   5.00             1     4          1
2            Larry Ellison  Redwood Shores      U2 CD                    2.00             2     2          1
2            Larry Ellison  Redwood Shores      UB40 CD                 4.00              2     3          1
2            Larry Ellison  Redwood Shores      Kingston USB 32GB       25.00             2     5          1

OTHER TIPS

In general, having a commalist in a column is a no-no.

If you ever need to ask MySQL to search for an item in that column or to sort on some part of that column, you will find that SQL is clumsy and inefficient. It is likely to be so clumsy that you will start over.

[Now I'll play devil's advocate.]

On the other hand, if you never need for MySQL to look inside that column, then it can be treated as an opaque blob of stuff. An example of such is a JPEG image.

On the other, other, hand, if the column is formatted in JSON, then there are some tools with which newer versions of MySQL/MariaDB can extract or search pieces. Still, this should not be the core of your schema design.

On the other, other, other, hand, if the columns is a string of "words", a FULLTEXT index can be a very efficient way for searching. But not much else.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange

RDBMS: Proper way to store data - comma separated variables or in different fields or tables?

Why storing data as a string a problem :

- First reason:

- Second reason:

- Third reason:

- Fourth reason:

- Fifth reason:

How to deal with the problem:

- First solution (extra fields):

- Second solution (extra records - 1-to-many relationships):

- Third solution (extra tables - many-to-many relationships):

_____________ Full DDL and DML _______________

Customer table:

Customer data:

item table:

item data:

receipt table:

receipt data:

line_item table:

line_item data:

Query:

Result:

_ Full DDL and DML ___

`Customer table:`

`Customer data:`

`item table:`

`item data:`

`receipt table:`

`receipt data:`

`line_item table:`

`line_item data:`