EAV - COUNT()/SUM() on column and row values with and without NULLs
Pregunta
Assume that there is a table
demo
with these data:
id | userID | title | data |
---|---|---|---|
1 | 1 | DATE | 01-01-2021 |
2 | 1 | COLOR | RED |
3 | 2 | DATE | |
4 | 2 | COLOR | YELLOW |
5 | 3 | DATE | 04-01-2021 |
6 | 3 | COLOR | RED |
5 | 4 | DATE | 04-01-2021 |
6 | 4 | COLOR | PINK |
7 | 5 | DATE | 02-01-2021 |
8 | 5 | COLOR | PINK |
7 | 6 | DATE | 08-01-2021 |
8 | 6 | COLOR | PINK |
If the data of the row date is empty color is not picked-up yet.
The information overview i need to view on website php page is:
Picked-up of RED: 2
Total RED: 2
Picked-up of Yellow: 0
Total Yellow: 1
Picked-up of Pink: 3
Total Pink: 3
Solución
Hi and welcome to the forum!
Your schema requires a totally radical overhaul. This image gives some idea of what's wrong with it:
You have used the EAV (Entity-Attribute-Value) database design anti-pattern - search for the writings of Joe Celko, Bill Karwin and others (and links therein) on this topic.
Also, and very important, you should store VARCHAR()
s as VARCHAR()
s and DATE
s as DATE
s - you will see this discussed when you read around EAV - data types go out the window and this means that the query optimiser can't generate an optimal plan. This applies to other types as well.
You can see this effect here:
CREATE TABLE date_test (x VARCHAR(10));
INSERT INTO date_test VALUES ('15-05-2021'), ('25-10-2000');
SELECT * FROM date_test ORDER BY x;
Result:
x
15-05-2021 -- <<==== WRONG order!
25-10-2000
One of the reasons that correct data types are important! Even if you are obliged to store dates as strings (I realise that programmers are subject to constraints), you should use the ISO date format (i.e. YYYY-MM-DD).
TL;DR - the answer!
If you just want the answer, then it's:
SELECT
dt.color AS "Color",
COUNT(dt.c_date) AS "Collected",
COUNT(*) AS "Total"
FROM demo_tab dt
GROUP BY dt.color
ORDER BY dt.color;
Result:
Color Collected Total
PINK 3 3
RED 2 2
YELLOW 1 2
It also has the advantage of being the fastest query of those proposed - see the PROFILING
section below!
Tackling the schema:
What you should do is (see fiddle here - adapted from @Akina's - I added a collected garment of yellow colour. I don't know what's going on with the id
column and I've ignored it hereafter).
So, you create a VIEW
- if you can't change the schema due to other constraints - I appreciate that DBAs/programmers don't always have (complete) control over their environments.
View:
CREATE VIEW demo_view AS
(
SELECT
x.user_id AS "uid",
x.data AS "color",
CAST(
CONCAT
(
SUBSTRING(y.data, 7, 10),
SUBSTRING(y.data, 3, 4),
SUBSTRING(y.data, 1, 2)
) AS DATE
) AS c_date
FROM demo x
JOIN demo y
ON x.user_id = y.user_id
AND x.title = 'COLOR'
AND y.title = 'DATE'
);
Table:
CREATE TABLE demo_tab AS
(
SELECT * FROM demo_view
--
-- Could also construct it as above!
--
);
Then your query on the table (could be on the view also - see fiddle) will be as simple as:
SELECT
dt.color AS "Color",
SUM(CASE
WHEN dt.c_date IS NOT NULL THEN 1
ELSE 0
END
) AS "Collected",
COUNT(*) AS "Total"
FROM demo_tab dt
GROUP BY dt.color;
Result (identical for all queries):
Color Collected Total
RED 2 2
YELLOW 1 2
PINK 3 3
Now, why, you may ask am I using the "complex" SUM(CASE...
instead of the relatively "easy" SUM(t1.data IS NOT NULL)...
of Akina's answer here. If you try and run his fiddle using PostgreSQL (see here), it fails. The SUM(... IS NOT NULL)
construct is MySQL specific whereas the other works with PostgreSQL, SQLServer and MySQL! If in doubt - test on PostgreSQL - if it doesn't work there, there's usually something fishy going on!
If you want to (or are obliged to) persist with the schema as it currently is, then run this:
SELECT
t2.data AS "Color",
SUM(CASE
WHEN t1.data IS NOT NULL THEN 1
ELSE 0
END
) AS "Collected",
COUNT(*) Total,
CAST(SUM(CASE
WHEN t1.data IS NOT NULL THEN 1
ELSE 0
END
) AS FLOAT) * 100 / COUNT(*) AS "% collected"
FROM demo t1
JOIN demo t2 ON t1.user_id = t2.user_id
WHERE t1.title = 'DATE'
AND t2.title = 'COLOR'
GROUP BY t2.data
ORDER BY 1;
Result - added a percentage collected field to it!
Color Collected Total % collected
PINK 3 3 100
RED 2 2 100
YELLOW 1 2 50
I did a performance analysis - see fiddle - using profiling - check the manual:
SET PROFILING = 1;
Run your queries
SHOW PROFILES;
Result (typical one shown - ran several times - may vary! Queries may be truncated):
Query_ID Duration Query
1 0.00056325 SELECT
dt.color AS "Color",
COUNT(dt.c_date) AS "Collected",
COUNT(*) AS "Total"
FROM demo_tab dt
GROUP BY dt.color
ORDER BY dt.color
Query_ID Duration Query
2 0.00064050 SELECT
dt.color AS "Color",
SUM(CASE
WHEN dt.c_date IS NOT NULL THEN 1
ELSE 0
END
) AS "Collected",
COUNT(*) AS "Total"
FROM demo_tab dt
GROUP BY dt.color
3 0.00074950 SELECT
t2.data AS "Color",
SUM(CASE
WHEN t1.data IS NOT NULL THEN 1
ELSE 0
END
) AS "Collected",
COUNT(*) Total,
CAST(SUM(CASE
WHEN t1.data IS NOT NULL THEN 1
ELSE 0
END
) AS FLOAT) * 100 / COUNT(*) AS "% collected"
FROM demo t1
JOIN demo t2 ON t1.user_id =
4 0.00075000 SELECT
v.color AS "Color",
SUM(CASE
WHEN v.c_date IS NOT NULL THEN 1
ELSE 0
END
) AS "Collected",
COUNT(*) AS "Total"
FROM demo_view v
GROUP BY v.color
SET PROFILING = 0;
The table-based query very frequently has the smallest duration (~ 90% of the time - we don't know what's going on with the rest of the dbfiddle server). The query plan for the table-based query is much shorter (EXPLAIN ANALYZE
- see fiddle) than the others - good rule of thumb, the shorter the plan, the faster the query!
If you want to use standard SQL (and you should!), you can use this command (see bottom of fiddle) - you can issue the SET sql_mode='ANSI';
command - from the aforementioned Bill Karwin's answer here.
user_id title data w x
1 DATE 01-01-2021 2021-01-01 2021-01-01
2 DATE NULL NULL NULL
3 DATE 04-01-2021 2021-01-04 2021-01-04
4 DATE 04-01-2021 2021-01-04 2021-01-04
5 DATE 02-01-2021 2021-01-02 2021-01-02
6 DATE 08-01-2021 2021-01-08 2021-01-08
7 DATE 07-07-2027 2027-07-07 2027-07-07
I leave it as an exercise for the reader to put what are now valid dates into the SQL above - it should work the exact same as the MySQL specific construct - you can see this in the PostgreSQL fiddle here.
I also leave it up to you to put in indexes on the necessary fields for large datasets - if you're having problems, get back to us here with a new question containing a link to the this one! EXPLAIN ANALYZE
should give a few hints in this regard!
Otros consejos
SELECT t2.data Color, SUM(t1.data IS NOT NULL) PickedUp, COUNT(*) Total
FROM demo t1
JOIN demo t2 USING (userID)
WHERE t1.title = 'DATE'
AND t2.title = 'COLOR'
GROUP BY t2.data;