EAV - COUNT()/SUM() on column and row values with and without NULLs

https://dba.stackexchange.com/questions/283130

13-03-2021
|

Question

Assume that there is a table

demo

with these data:

id	userID	title	data
1	1	DATE	01-01-2021
2	1	COLOR	RED
3	2	DATE
4	2	COLOR	YELLOW
5	3	DATE	04-01-2021
6	3	COLOR	RED
5	4	DATE	04-01-2021
6	4	COLOR	PINK
7	5	DATE	02-01-2021
8	5	COLOR	PINK
7	6	DATE	08-01-2021
8	6	COLOR	PINK

If the data of the row date is empty color is not picked-up yet.

The information overview i need to view on website php page is:

Picked-up of RED: 2
Total RED: 2

Picked-up of Yellow: 0
Total Yellow: 1

Picked-up of Pink: 3
Total Pink: 3

Solution

Hi and welcome to the forum!

Your schema requires a totally radical overhaul. This image gives some idea of what's wrong with it:

You have used the EAV (Entity-Attribute-Value) database design anti-pattern - search for the writings of Joe Celko, Bill Karwin and others (and links therein) on this topic.

Also, and very important, you should store VARCHAR()s as VARCHAR()s and DATEs as DATEs - you will see this discussed when you read around EAV - data types go out the window and this means that the query optimiser can't generate an optimal plan. This applies to other types as well.

You can see this effect here:

CREATE TABLE date_test (x VARCHAR(10));

INSERT INTO date_test VALUES ('15-05-2021'), ('25-10-2000');

SELECT * FROM date_test ORDER BY x;

Result:

         x
15-05-2021     -- <<==== WRONG order!
25-10-2000

One of the reasons that correct data types are important! Even if you are obliged to store dates as strings (I realise that programmers are subject to constraints), you should use the ISO date format (i.e. YYYY-MM-DD).

TL;DR - the answer!

If you just want the answer, then it's:

SELECT 
  dt.color AS "Color", 
  COUNT(dt.c_date) AS "Collected",
  COUNT(*) AS "Total"
FROM demo_tab dt
GROUP BY dt.color
ORDER BY dt.color;

Result:

  Color Collected   Total
   PINK         3       3
    RED         2       2
 YELLOW         1       2

It also has the advantage of being the fastest query of those proposed - see the PROFILING section below!

Tackling the schema:

What you should do is (see fiddle here - adapted from @Akina's - I added a collected garment of yellow colour. I don't know what's going on with the id column and I've ignored it hereafter).

So, you create a VIEW - if you can't change the schema due to other constraints - I appreciate that DBAs/programmers don't always have (complete) control over their environments.

View:

CREATE VIEW demo_view AS
(
  SELECT 
    x.user_id AS "uid", 
    x.data AS "color", 
    CAST(
          CONCAT
          (
            SUBSTRING(y.data, 7, 10), 
            SUBSTRING(y.data, 3, 4), 
            SUBSTRING(y.data, 1, 2)
          ) AS DATE
        ) AS c_date
  FROM demo x
  JOIN demo y
    ON x.user_id = y.user_id
    AND x.title = 'COLOR' 
    AND y.title = 'DATE'
);

Table:

CREATE TABLE demo_tab AS
(
  SELECT * FROM demo_view
  
  --
  -- Could also construct it as above!
  -- 
);

Then your query on the table (could be on the view also - see fiddle) will be as simple as:

SELECT 
  dt.color AS "Color", 
  SUM(CASE
      WHEN dt.c_date IS NOT NULL THEN 1
      ELSE 0
    END 
   ) AS "Collected",
COUNT(*) AS "Total"
FROM demo_tab dt
GROUP BY dt.color;

Result (identical for all queries):

 Color   Collected   Total
   RED           2       2
YELLOW           1       2
  PINK           3       3

Now, why, you may ask am I using the "complex" SUM(CASE... instead of the relatively "easy" SUM(t1.data IS NOT NULL)... of Akina's answer here. If you try and run his fiddle using PostgreSQL (see here), it fails. The SUM(... IS NOT NULL) construct is MySQL specific whereas the other works with PostgreSQL, SQLServer and MySQL! If in doubt - test on PostgreSQL - if it doesn't work there, there's usually something fishy going on!

If you want to (or are obliged to) persist with the schema as it currently is, then run this:

SELECT 
  t2.data AS "Color",
SUM(CASE
      WHEN t1.data IS NOT NULL THEN 1
      ELSE 0
    END 
   ) AS "Collected",
COUNT(*) Total,
CAST(SUM(CASE
      WHEN t1.data IS NOT NULL THEN 1
      ELSE 0
    END 
   ) AS FLOAT) * 100 / COUNT(*) AS "% collected"
FROM demo t1
JOIN demo t2 ON t1.user_id = t2.user_id
WHERE t1.title = 'DATE'
  AND t2.title = 'COLOR'
GROUP BY t2.data
ORDER BY 1;

Result - added a percentage collected field to it!

  Color Collected   Total   % collected
   PINK         3       3           100
    RED         2       2           100
 YELLOW         1       2            50

I did a performance analysis - see fiddle - using profiling - check the manual:

SET PROFILING = 1;

Run your queries

SHOW PROFILES;

Result (typical one shown - ran several times - may vary! Queries may be truncated):

Query_ID        Duration    Query
       1        0.00056325  SELECT 
  dt.color AS "Color", 
  COUNT(dt.c_date) AS "Collected",
  COUNT(*) AS "Total"
FROM demo_tab dt
GROUP BY dt.color
ORDER BY dt.color
Query_ID        Duration    Query
       2             0.00064050 SELECT 
  dt.color AS "Color", 
  SUM(CASE
      WHEN dt.c_date IS NOT NULL THEN 1
      ELSE 0
    END 
   ) AS "Collected",
COUNT(*) AS "Total"
FROM demo_tab dt
GROUP BY dt.color
3             0.00074950    SELECT 
  t2.data AS "Color",
SUM(CASE
      WHEN t1.data IS NOT NULL THEN 1
      ELSE 0
    END 
   ) AS "Collected",
COUNT(*) Total,
CAST(SUM(CASE
      WHEN t1.data IS NOT NULL THEN 1
      ELSE 0
    END 
   ) AS FLOAT) * 100 / COUNT(*) AS "% collected"
FROM demo t1
JOIN demo t2 ON t1.user_id =
4             0.00075000    SELECT 
  v.color AS "Color", 
  SUM(CASE
      WHEN v.c_date IS NOT NULL THEN 1
      ELSE 0
    END 
   ) AS "Collected",
COUNT(*) AS "Total"
FROM demo_view v
GROUP BY v.color

SET PROFILING = 0;

The table-based query very frequently has the smallest duration (~ 90% of the time - we don't know what's going on with the rest of the dbfiddle server). The query plan for the table-based query is much shorter (EXPLAIN ANALYZE - see fiddle) than the others - good rule of thumb, the shorter the plan, the faster the query!

If you want to use standard SQL (and you should!), you can use this command (see bottom of fiddle) - you can issue the SET sql_mode='ANSI'; command - from the aforementioned Bill Karwin's answer here.

user_id title        data           w            x
      1  DATE  01-01-2021  2021-01-01   2021-01-01
      2  DATE        NULL         NULL        NULL          
      3  DATE   04-01-2021  2021-01-04  2021-01-04
      4  DATE   04-01-2021  2021-01-04  2021-01-04
      5  DATE   02-01-2021  2021-01-02  2021-01-02
      6  DATE   08-01-2021  2021-01-08  2021-01-08
      7  DATE   07-07-2027  2027-07-07  2027-07-07

I leave it as an exercise for the reader to put what are now valid dates into the SQL above - it should work the exact same as the MySQL specific construct - you can see this in the PostgreSQL fiddle here.

I also leave it up to you to put in indexes on the necessary fields for large datasets - if you're having problems, get back to us here with a new question containing a link to the this one! EXPLAIN ANALYZE should give a few hints in this regard!

OTHER TIPS

SELECT t2.data Color, SUM(t1.data IS NOT NULL) PickedUp, COUNT(*) Total
FROM demo t1
JOIN demo t2 USING (userID)
WHERE t1.title = 'DATE'
  AND t2.title = 'COLOR'
GROUP BY t2.data;

fiddle

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange